Thursday, 27 July 2017

Setting up Alert Manager

In last post of monitoring with prometheus series, we saw how to install and run prometheus and node exporter.


Alert Rules ?
Now ,Our node exporters are running on port 9100. You can access in browser and can try with few metrics to see values e.g. NODE_CPU_USAGE. This will give us CPU usage for machine.


Being machine monitored ,important thing is to get alerts on specific machine resource usage. Say When CPU usage raised to 80% ,we should get alert so that we can take some preventive action.We can set Alert Rules to achieve this.


In prometheus folder , You can create a new file with any name ,here we are creating “Alert.rules” as below


   description = "This device's memory usage has exceeded the threshold with a value of {{ $value }}.",
 }


ALERT filesystem_threshold_exceeded
 IF 100 *(1 - (node_filesystem_free{mountpoint="/"}  / node_filesystem_size{ mountpoint="/"}) ) > 80
 LABELS {severity="CRITICAL"}
 ANNOTATIONS {
   summary = "{{ $labels.group }}-{{ $labels.instance }} High filesystem usage is detected",
   description = "This device's filesystem usage has exceeded the threshold with a value of {{ $value }}.",
 }


In Above File we have set basic alert rules to watch If instance is up , CPU usage ,Memory and file disk usage.
Let's see this file in detail


ALERT InstanceDown    
 IF up == 0
 FOR 10m
 LABELS { severity = "CRITICAL" }
 ANNOTATIONS {
   summary = "Instance down",
   description = "{{ $labels.group }}-{{$labels.instance}} - instance has been down for more than 10 minute.",
 }


First line defines Alert name.
Second line - If condition is prometheus query where we check if instance is up and running or not. If instance is down for 10 mins then It will create alert for it. Now we need to send this alert to Email. lets see how to do it.   
First we need to configure alert rules to prometheus server . In Prometheus.yml file we will do little modification for this configuration


global:
 evaluation_interval: 15s
 external_labels:
   monitor: 'Prometheus_Alerts'
 scrape_interval: 15s
rule_files:
 - "alert.rules"
We have added Rule_files section and mentioned our alert.rules file. We can add as many different files here.


Now as we have changed configuration file we need to reload prometheus with below curl command


curl -X POST http://IP:9090/-/reload
You can see Alerts on portal at http://IP:9090/Alerts




What is AlertManager?
Alertmanager can be configured to send prometheus alerts to your mailbox,slack,get automated calls in critical situation etc.
Prometheus with Alertmanager is perfect match for monitoring your infrastructure and getting notified instantly at early stage so that we can take look before any mishap.You can monitor CPU usage,Memory,availability, etc metrics of instance.


Though we implement node exporters and prometheus for infra ,We can not keep checking all values to monitor instances.Therefore we need Alerting tool which will send alerts to specified receivers if metrics reach it’s peak value.


In this post we will cover installation of Alertmanager and configure it to send alerts to your mail id’s. For Prometheus and node exporter setup refer previous post.


Setup Alertmanager


wget https://github.com/prometheus/alertmanager/releases/download/v0.8.0/alertmanager-0.8.0.linux-amd64.tar.gz


tar -xvf alertmanager-0.8.0.linux-amd64.tar.gz
mv alertmanager-0.8.0.linux-amd64 alertmanager


cd alertmanager


Sample of alertmanager.yml


global:
route:
 receiver: default
 group_wait: 10s
 group_interval: 1m
 repeat_interval: 4h
 routes:
 - receiver: email-QA
   match:
     group: 'QA'
 - receiver: email-prod
   match:
     group: 'Production'
receivers:
- name: "default"
 email_configs:
 - to: 'default.receivers@example.com'
   from: 'SMTP_username'
   smarthost: 'SMTP_server_address:PORT'
   auth_username: 'SMTP_username'
   auth_identity: 'SMTP_username'
   auth_password: 'password'
- name: "email-QA"
 email_configs:
 - to: 'QA_receivers@example.com'
   from: 'SMTP_username'
   smarthost: 'SMTP_server_address:PORT'
   auth_username: 'SMTP_username'
   auth_identity: 'SMTP_username'
   auth_password: 'password'
- name: "email-prod"
 email_configs:
 - to: 'Prod_operation@example.com'
   from: 'SMTP_username'
   require_tls: true
   smarthost: 'SMTP_server_address:PORT'
   auth_username: 'SMTP_username'
   auth_identity: 'SMTP_username'
   auth_password: 'password'


Here in above file ,you can replace your mailbox details. As we can see we have distributed email receivers for diff env. For QA group machines we can have different receivers and for Prod related we can add support groups. This will in segregating emails and to decide priorities over different emails.also we can use similar email setting together instead repeating same details for each receiver.for demonstration purpose we shown here to mention we can use different mailbox details within same file.
 repeat_interval: 4h


This will make sure next alert for same rule will get trigger after 4 hours so that support team will not receive same alerts again and again.


Now , Let’s Run Alert Manager
./alertmanager -config.file=alertmanager.yml &


Alert manager runs on port 9093 .You can open this port in security group and from browser you can access http://IP:9093  


After Alert Manager Ran successfully,we need to  re run Prometheus as below


First stop running prometheus process.(as we have not provided alertmanager url in previous blog)


Now ./prometheus -alertmanager.url=http://IP:9093 &


You will start getting alert notifications on your mail.Prometheus node exporter provides instance metrics but we want to monitor various process in our application ,So our next process will be on Application monitoring to get metrics for each process with same setup.

ALERT InstanceDown
 IF up == 0
 FOR 10m
 LABELS { severity = "CRITICAL" }
 ANNOTATIONS {
   summary = "Instance down",
   description = "{{ $labels.group }}-{{$labels.instance}} - instance has been down for more than 10 minute.",
 }


ALERT NodeCPUUsage
 IF (100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[1m])) * 100)) > 75
 FOR 2m
 LABELS {severity="CRITICAL"}
 ANNOTATIONS {
   SUMMARY = "{{ $labels.group }}-{{$labels.instance}}: High CPU usage detected",
   DESCRIPTION = " CPU usage is above 75% (current value is: {{ $value }})"
 }


ALERT mem_threshold_exceeded
 IF (node_memory_MemFree{job='node'} + node_memory_Cached{job='node'} + node_memory_Buffers{job='node'})/1000000 > 7000
 LABELS {severity="CRITICAL"}
 ANNOTATIONS {
   summary = "{{ $labels.group }}- {{ $labels.instance }} High memory usage detected",

In next post we will see how to monitor process with prometheus.

1 comment:

Amazon EKS - Kubernetes on AWS

By Komal Devgaonkar Amazon Elastic Container Service for Kubernetes (Amazon EKS), which is highly available and scalable AWS service....