Background
Computers are very good at keeping lists of things that need to be done as well as when they need to be done. Computers have numerous sensors that can detect if the hardware is running normally or starting to fail. Computers can also be taught to gather information about the operating system (Windows), database systems, and important applications (such as your line of business systems) to make sure they are operating at peak efficiency.
For years, people have used these characteristics to build software to monitor environments, locally and remotely. When a problem is detected, the remote management software’s information is used to assign people to fix the issue.
That’s where we were several years ago when we were asked to provide managed services for some of our larger customers. We used remote management software in their large environment, and it alerted us when there was an issue in one of the environments. One of the biggest issues we faced was that the root cause of the problem that needed fixing came so fast and strong (and most times unexpectedly) that by the time we could get to fixing it, the problem had escalated past the point where a simple solution was possible.
Examples of When this Could or Does Happen
These are some examples where Dexterity would be or has been very effective:
A More Realistic View
Issues like these cause trouble for most or all of your servers at the same time, requiring that they all be addressed at the same time. If they are fixed one by one, different results may happen:
- The first server to be fixed could be automatically offloaded with all of the workload that can’t be handled by the remaining servers, causing it to return to a failed state within a short period of time.
- A fix applied to one server could begin a chain reaction that would make other servers more difficult to fix.
- By the time the first server is fixed, the other servers may be past the point where relatively simple fixes will be effective.
An Enterprise Immune System
Dexterity is designed to be like an immune system for an enterprise.
First, on a regular basis, it performs many of the regular maintenance tasks normally done by administrators or scheduled scripts, plus several others that are specifically designed to maintain good performance of enterprise line of business applications.
On regular intervals, it samples the health of the system, looking for deviations from normal patterns of behavior. It checks log files, services, amounts of time the system has spent since the previous check performing certain activities, status of important system indicators and more. If it notices a behavior pattern emerging that indicates that periods of unusually high activity may be approaching, it accelerates the activities that would ensure good application performance to prevent problems such as running out of disk space, databases becoming less efficient, or tasks getting overloaded. If the pattern reverses (false alarm/rise in activity was only short-lived), the schedule of remediation returns to normal maintenance levels.
All of these things happen automatically moments after the behaviors change to all servers being impacted. With a typical remote monitoring/staff resolution scenario, it could take an hour or more to locate the source of the problem and mobilize staff to correct it. Additionally, once they know what the problem is, they will always have a disadvantage when the number of machines starting to fail exceeds the number of people working to resolve the issue.