Dr Javier Alvarez receives his PhD with a thesis entitled Adaptive Anomalous Behavior Identification in Large-Scale Distributed Systems. Dr Alvarez’ thesis also receives a Dean’s Commendation for Research Excellence.
Thesis abstract: Distributed systems have become pervasive. From laptops and mobile phones, to servers and data centers, most computers communicate and coordinate their activities through some kind of network. Moreover, many economic and commercial activities of today’s society rely on distributed systems. Examples range from widely used large-scale web services such as Google or Facebook, to enterprise networks and banking systems. However, as distributed systems become larger, more complex, and more pervasive, the probability of failures or malicious activities also increases. The negative effects of failures in distributed systems range from economic
The negative effects of failures in distributed systems range from economic losses, to sensitive information leaks, to loss of life in extreme cases. As an example, reports show that the cost of downtime in industry ranges from $100K to $540K per hour on average. These undesired consequences could, in many cases, be avoided with better monitoring tools able to timely inform system administrators of the presence of anomalies in the system. However, anomaly detection in distributed systems remains as an open problem due to key challenges such as the difficulty in processing large amounts of information, the decentralized nature of these kind of systems, the huge variety of anomalies that can appear, and the difficulty in characterizing these anomalies.
This thesis contributes with a novel framework for the online detection and identification of anomalies in large-scale distributed systems. Our framework periodically collects system performancemetrics, and builds abehavior characterization from these metrics in a way that maximizes the distance between normal and anomalous behaviors. Our framework then uses machine learning techniques to detect previously unseen anomalies, and to identify the type of known anomalies with high accuracy, while overcoming some of the limitations of existing works in the area. Our framework does not require historical data, which is unavailable in most real world scenarios; performs a behavioral system analysis able to reveal a wide range of anomalies; can be employed in a plug-and-play manner without code instrumentation; adapts to changes in the system behavior; and allows for a flexible deployment that can be tailored to numerous scenarios with different requirements.
We demonstrate the applicability of our framework through an extensive exper- imental analysis in three scenarios with different behavioral characteristics: a dis- tributed system that exhibits complex anomalous behaviors, a large-scale system, and a large-scale network containing malicious nodes. For each scenario, we provide a deep understanding of the system behavior from the point of view of its performance metrics, define how our framework can be employed to fulfill the scenario specific requirements, and evaluate our framework’s performance in detecting and identifying different anomalies.