Truth is there is always something going on. With over 50 shared IIS web servers across two datacenters we are always troubleshooting something. Due to the dynamic nature of shared hosting environment, and the sheer number of the domains and applications we host, it is impossible and impractical for us to monitor each individual domain. We can tell if the server or service is operational as a whole, however not an individual domain. At any given time one user can monopolize on entire server CPU and disk IO resource pool by executing buggy code and effectively knock out others. We may see this event as a CPU spike and investigate further, but on 5 minute avg. sample sensor short duration I/O spikes often resolve before we get to them. On the other hand serious conditions like degraded RAID array, we may opt to bring server offline in order to expedite the rebuild time or minimize the possibility of data loss.
This may be evident more so to the customers that host large number of domains with us distributed over many different servers. It may appear somewhat similar to “whack-a-mole” game.