Initial work from Javier Álvarez’s thesis work on modelling failure in distributed systems has been accepted for 34th Symposium on Reliable Distributed Systems (SRDS 2015).
Javier Álvarez Cid-Fuentes, Claudia Szabo and Katrina Falkner, Online Behavior Identification in Distributed Systems. In Proceedings of the 34th Symposium on Reliable Distributed Systems (SRDS 2015).
The diagnosis, prediction, and understanding of unexpected behavior is crucial for long running, large scale distributed systems. However, existing works focus on the identification of faults in specific time moments preceded by significantly abnormal metric readings, or require a previous analysis of historical failure data. In this work, we propose an online behavior classification system to identify a wide range of undesired behaviors, which may appear even in healthy systems, and their evolution over time. We employ a two-step process involving two online classifiers on periodically collected system metrics to identify at runtime normal and anomalous behaviors such as deadlock, starvation and livelock, without any previous analysis of historical failure data. Our approach achieves over 80% accuracy in detecting unexpected behaviors and over 90% accuracy in identifying their type with a short delay after the anomalies appear, and with minimal expert intervention. Our experimental analysis uses system execution traces obtained from a Google cluster and from our in-house distributed system with varied behaviors, and shows the benefits of our approach as well as future research challenges.