Sunday, 6 November 2016

DevOps + Machine Learning

Alice is a very enthusiastic DevOps engineer who manages a cloud based data center or a high availability web service, she has to constantly monitor the health of the nodes and in case of a problem take corrective measures.
In case of failures, the HA would failover to the back up systems but the devops responsibility would include tracing the trail of failure and prevent a repeat incident.
For this, Alice has to audit the accesses, the requests, the database queries and other parameters around the system to find an anomaly.

Anomalies are patterns that do not conform to the normal behavior and normal behavior itself is subjective, varies from one site to another and could be a product of an external event.
But they are difficult to catch in software, while for some scenarios humans seem to fare much better but as the dimensionality increases, there is a need for Machine learning to step in. Also, when a system services a high volume of data, it would be overwhelming for human capacity.

The mathematics for anomaly detection is well known and proven but the data to be fed into the system requires domain expertise and even tweaks in the product engineering.

Machine Learning based on these principles are starting to come up in many areas such as:
  • Fraud Detection
  • Medical diagnosis
  • Security systems
The challenge that Alice faces though, is different. Because, she has to treat the product implementation as a black box and any metrics to be gathered have to be in a way that do not alter the behavior of the product itself.

She starts by collecting different metrics that make up a time series data comprising:
  • Number of requests
  • CPU Load
  • Network Load
In addition to that, she also collects attributes of the data itself that might have an impact on each other, this allows her to gain a context. For example payload length of the requests, the HTTP method, authentication state, response code, response length.

Then there are some kind of data, that would be extremely hard for her to collect. Like if she were monitoring the traffic to an ecommerce website that sells Indian cricket team merchandize, this would see a huge transaction jump or fall depending on how the team was performing or before a world cup.

If she wants to go a step further though, she can find more attributes for each state in the time series such as
  • Request to response time
  • System Load
  • Database Load
  • Concurrency
  • Memory in use
Using these she hopes to detect:
  • Point Anomaly: Unusual when compared to the rest of points, the outlier
  • Context Anomaly: Unusual in how it is reached, even though context is valid.
  • Collective Anomaly: Similar to point but instead of a single point, there is a collection of points or attributes. (more than 1 dimension)
There are 2 main techniques that Alice can choose from:
  • Deterministic would be much like medical monitoring machines, where a line graph monitoring pulse raises an alarm on thresholds. The problem with this is the number of attributes to be monitored could be extremely high, so setting thresholds is difficult.
  • Statistic models use correlation, where the relations between metrics are statistically determined (Pearson correlation coefficient). There has been a spurt of interest in this method as explained in this medical example or network data example
Though Alice knows there are 2 challenges that need to be addressed before the system can be relied on for industrial adoption.

Moving Normals
The best time series models are those that can learn inline and adjust their definition of normal behavior over time.
Suppose, Alice sets up the system and it works very well, but for some reason the EC2 instance is decided to be moved to a smaller system to save cost. The CPU usage would shoot up, and would need a new normal or more usual case would be the traffic on sunday compared to the surge in the traffic on monday.

Workflows can give very useful insights into normal behavior and therefore to detect abnormal ones. The order in which we do things, could be an extremely useful tool.
If we just say the exact same words as in the sentence above but in a different order, it would make no sense even though it would have all the information. So even though Alice has all the information, this might be the reason why she has not hit the perfect solution yet.

For example, Being able to trace every database query to the origin API would give the detection engine more context.
There are some web frameworks that allow this behavior like explained in this Django framework example.

Anomaly detection in essence is an unsupervised learning algorithm where the numbers are fed in and if the numbers actually capture the state of the system, then there is a good chance of an anomaly being detected, but how to measure the performance?
  • False Positive / False Negative Rate
  • Detection turnaround time
  • Cost of failure
The industry trend over the past few years has seen wider adoption of machine learning techniques to solve unusual problems and DevOps will be an area of hot research. The volume of data to process and the need for a real time analysis system to solve DevOps challenges will be key motivations while the features to capture will remain the biggest hurdle in the next stage.

Enjoyed reading this? Want to know more about some of our expert solutions? Write to me at: /


  1. Very well explanation of DevOps relation with Machine Learning. Working on machine learning with a DevOps mindset would lead to more rapid results.

    Best Regards,

    CourseIng - DevOps Training | Learn DevOps Course | DevOps Course Training in Hyderabad

  2. At a really excessive degree, you'll be able to consider EC2 as a worldwide computing setting. Inside EC2 are geographical areas that may be regarded as information facilities. Inside of those information facilities are laptop clusters that in AWS parlance are known as Availability Zones.This is great blog. If you want to know more about this visit here AWS Cloud Certification.

  3. This comment has been removed by the author.

  4. good explanation you given for devops. thak you for sharing this. Best devops training in chennai | java training in chennai

  5. I just loved your article on the beginners guide to starting a blog.If somebody take this blog article seriously in their life, he/she can earn his living by doing blogging.thank you for thizs article. top devops online training


Amazon EKS - Kubernetes on AWS

By Komal Devgaonkar Amazon Elastic Container Service for Kubernetes (Amazon EKS), which is highly available and scalable AWS service....