Correlation of Metrics for Automated Performance Problem Diagnosis

Motivation

In the last decade, application performance management (APM) solutions have been developed supporting enterprises with monitoring capabilities and early detection of performance problems. Leading application APM solutions mostly support only alerting and visualization of performance-relevant measures. The configuration of the software instrumentation, the diagnosis of performance problems, and the isolation of the concrete root cause(s) often remain error-prone and frustrating manual tasks. To this day, these tasks are performed by costly and rare performance experts. In order to improve this situation, NovaTec Consulting GmbH and the University of Stuttgart (Reliable Software Systems Group) launched the collaborative research project diagnoseIT on "Expert-guided Automatic Diagnosis of Performance Problems in Enterprise Applications". Hereby, the core idea is to formalize APM expert knowledge to automatically execute recurring APM tasks such as the configuration of a meaningful software instrumentation and the diagnosis of performance problems to isolate their root cause. By delegating the described tasks to diagnoseIT, experts do not have to deal with similar problems over and over again. Instead, the expert can focus on more challenging (and interesting) tasks.

Problem

The automated diagnosis analyzes traces (individual or as a stream provided by monitoring capabilities of APM solutions) based on extensible set of rules. When a symptom is detected in a trace, the root cause diagnosis is started without the need for human interaction. Rules that perform localization of the problem are applied first, followed by technology and/or domain-specific rules, which are used to semantify the isolated root cause. A performance problem that causes high CPU utilization can cause performance problems in traces that are executed in parallel. In this case, many performance problems are side-effects of the first performance problem. To isolate the root cause in such a scenario, it's necessary that metrics collected from active and passive resources are used as complementary information in the diagnosis process. 

Tasks

  • Integration of the passive and active resource monitoring data in the generic data access model Common Trace API (CTA)
  • Development detection concepts and rules to correlate the information from traces and passive or active resources
  • Evaluation of the concepts and rules in case study

Challenges

  • Alignment of the detection concept with the diagnoseIT rule processing concept

Locations

  • Stuttgart (preferred)
  • Frankfurt (remote supervision)
  • Munich (remote supervision)
  • Berlin (remote supervision)

Contact