In last post of monitoring with prometheus series, we saw how to install and run prometheus and node exporter.
Alert Rules ?
Now ,Our node exporters are running on port 9100. You can access in browser and can try with few metrics to see values e.g. NODE_CPU_USAGE. This will give us CPU usage for machine.
Being machine monitored ,important thing is to get alerts on specific machine resource usage. Say When CPU usage raised to 80% ,we should get alert so that we can take some preventive action.We can set Alert Rules to achieve this.
In prometheus folder , You can create a new file with any name ,here we are creating “Alert.rules” as below
description = "This device's memory usage has exceeded the threshold with a value of {{ $value }}.",
}
ALERT filesystem_threshold_exceeded
IF 100 *(1 - (node_filesystem_free{mountpoint="/"} / node_filesystem_size{ mountpoint="/"}) ) > 80
LABELS {severity="CRITICAL"}
ANNOTATIONS {
summary = "{{ $labels.group }}-{{ $labels.instance }} High filesystem usage is detected",
description = "This device's filesystem usage has exceeded the threshold with a value of {{ $value }}.",
}
In Above File we have set basic alert rules to watch If instance is up , CPU usage ,Memory and file disk usage.
Let's see this file in detail
ALERT InstanceDown
IF up == 0
FOR 10m
LABELS { severity = "CRITICAL" }
ANNOTATIONS {
summary = "Instance down",
description = "{{ $labels.group }}-{{$labels.instance}} - instance has been down for more than 10 minute.",
}
First line defines Alert name.
Second line - If condition is prometheus query where we check if instance is up and running or not. If instance is down for 10 mins then It will create alert for it. Now we need to send this alert to Email. lets see how to do it.
First we need to configure alert rules to prometheus server . In Prometheus.yml file we will do little modification for this configuration
global:
evaluation_interval: 15s
external_labels:
monitor: 'Prometheus_Alerts'
scrape_interval: 15s
rule_files:
- "alert.rules"
We have added Rule_files section and mentioned our alert.rules file. We can add as many different files here.
Now as we have changed configuration file we need to reload prometheus with below curl command
curl -X POST http://IP:9090/-/reload
What is AlertManager?
Alertmanager can be configured to send prometheus alerts to your mailbox,slack,get automated calls in critical situation etc.
Prometheus with Alertmanager is perfect match for monitoring your infrastructure and getting notified instantly at early stage so that we can take look before any mishap.You can monitor CPU usage,Memory,availability, etc metrics of instance.
Though we implement node exporters and prometheus for infra ,We can not keep checking all values to monitor instances.Therefore we need Alerting tool which will send alerts to specified receivers if metrics reach it’s peak value.
In this post we will cover installation of Alertmanager and configure it to send alerts to your mail id’s. For Prometheus and node exporter setup refer previous post.
Setup Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.8.0/alertmanager-0.8.0.linux-amd64.tar.gz
tar -xvf alertmanager-0.8.0.linux-amd64.tar.gz
mv alertmanager-0.8.0.linux-amd64 alertmanager
cd alertmanager
Sample of alertmanager.yml
global:
route:
receiver: default
group_wait: 10s
group_interval: 1m
repeat_interval: 4h
routes:
- receiver: email-QA
match:
group: 'QA'
- receiver: email-prod
match:
group: 'Production'
receivers:
- name: "default"
email_configs:
- to: 'default.receivers@example.com'
from: 'SMTP_username'
smarthost: 'SMTP_server_address:PORT'
auth_username: 'SMTP_username'
auth_identity: 'SMTP_username'
auth_password: 'password'
- name: "email-QA"
email_configs:
- to: 'QA_receivers@example.com'
from: 'SMTP_username'
smarthost: 'SMTP_server_address:PORT'
auth_username: 'SMTP_username'
auth_identity: 'SMTP_username'
auth_password: 'password'
- name: "email-prod"
email_configs:
- to: 'Prod_operation@example.com'
from: 'SMTP_username'
require_tls: true
smarthost: 'SMTP_server_address:PORT'
auth_username: 'SMTP_username'
auth_identity: 'SMTP_username'
auth_password: 'password'
Here in above file ,you can replace your mailbox details. As we can see we have distributed email receivers for diff env. For QA group machines we can have different receivers and for Prod related we can add support groups. This will in segregating emails and to decide priorities over different emails.also we can use similar email setting together instead repeating same details for each receiver.for demonstration purpose we shown here to mention we can use different mailbox details within same file.
repeat_interval: 4h
This will make sure next alert for same rule will get trigger after 4 hours so that support team will not receive same alerts again and again.
Now , Let’s Run Alert Manager
./alertmanager -config.file=alertmanager.yml &
Alert manager runs on port 9093 .You can open this port in security group and from browser you can access http://IP:9093
After Alert Manager Ran successfully,we need to re run Prometheus as below
You will start getting alert notifications on your mail.Prometheus node exporter provides instance metrics but we want to monitor various process in our application ,So our next process will be on Application monitoring to get metrics for each process with same setup.
ALERT InstanceDown
IF up == 0
FOR 10m
LABELS { severity = "CRITICAL" }
ANNOTATIONS {
summary = "Instance down",
description = "{{ $labels.group }}-{{$labels.instance}} - instance has been down for more than 10 minute.",
}
ALERT NodeCPUUsage
IF (100 - (avg by (instance) (irate(node_cpu{job="node",mode="idle"}[1m])) * 100)) > 75
FOR 2m
LABELS {severity="CRITICAL"}
ANNOTATIONS {
SUMMARY = "{{ $labels.group }}-{{$labels.instance}}: High CPU usage detected",
DESCRIPTION = " CPU usage is above 75% (current value is: {{ $value }})"
}
ALERT mem_threshold_exceeded
IF (node_memory_MemFree{job='node'} + node_memory_Cached{job='node'} + node_memory_Buffers{job='node'})/1000000 > 7000
LABELS {severity="CRITICAL"}
ANNOTATIONS {
summary = "{{ $labels.group }}- {{ $labels.instance }} High memory usage detected",
|
In next post we will see how to monitor process with prometheus.
best explanation
ReplyDelete