Prometheus Cheatsheet

Prometheus is an open-source monitoring and alerting system that is used to collect and process metrics data from various sources such as servers, applications, and services. It was developed by SoundCloud and is now part of the Cloud Native Computing Foundation (CNCF).

It uses a pull-based model to collect metrics data from various sources using its own custom query language, PromQL. It stores metrics data in a time series database and provides a powerful query language for analyzing and visualizing the data.

Prometheus also provides a built-in alerting system that can send notifications to various channels such as email, Slack, and PagerDuty when certain conditions are met. It integrates with various third-party tools such as Grafana for visualization and Alertmanager for alert management.

Prometheus is a popular choice for monitoring containerized environments such as Kubernetes due to its scalability, flexibility, and compatibility with various data sources. It can also be used to monitor traditional infrastructure such as servers, databases, and networks.

Prometheus Setup

Install

Update

Prerequisites:

Ensure you have sudo access on the server (commands require elevated privileges)
Prometheus configuration files: /etc/prometheus
Prometheus executables: /usr/local/bin
Prometheus systemd service: prometheus.service

# Show the current Prometheus version
prometheus --version

Stop Prometheus services:

# Stop running services related to Prometheus for a while untill the update is completed
systemctl stop prometheus.service

Setup Prometheus binaries:

# Download the Prometheus archive
wget https://github.con/prometheus/prometheus/releases/download/<VERSION>/prometheus-<VERSION>.linux.amd64.tar.gz

# Unpack the archive
tar -xvf prometheus-<VERSION>.linux.amd64.tar.gz

Note: Refer to the official Prometheus Download page to get the latest stable version for the Linux binary.

Update Prometheus:

# Copy the executables and change the ownership
cp prometheus-<VERSION>.linux.amd64/{prometheus,promtool} /usr/local/bin
chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}

# Copy the console libraries and hcange the ownership
cp -r prometheus-<VERSION>.linux.amd64/{consoles,console_libraries} /etc/prometheus
chown -R prometheus:prometheus /etc/prometheus

Verify and restart services:

# Verify the content of the Prometheus data directory
ls -l /var/lib/prometheus

# Restart the services
systemctl daemon-reload
systemctl start prometheus.service
systemctl enable prometheus.service

# Check the status of the service
systemctl status prometheus.serivce

Verify the Prometheus version:

# Show the current Prometheus version
prometheus --version

Metrics

Types of metrics

Counter: A metric that represents a single monotonically increasing counter whose value can only increase or be reset to zero.
Gauge: A metric that represents a single value that can arbitrarily go up and down.
Histogram: A metric that samples observations and counts them in configurable buckets. It also provides a sum of all observed values.
Summary: Similar to a histogram, a summary samples observations but instead of counting them in buckets, it calculates percentiles over a sliding time window.

Metric naming conventions

Use lowercase letters and separate words with underscores.
Use meaningful and consistent names.
Start metric names with the name of the service or application.
Use labels to provide additional context to a metric.

Metric syntax

metric_name{instance="instance", job="job", label_name="label_value", ...}          metric_value

Note: See Instrumentation best practices

PromQL

Selecting Metrics

# Select latest sample for series with a given metric name:
node_cpu_seconds_total

# Select 5-minute range of samples for series with a given metric name:
node_cpu_seconds_total[5m]

# Select only series with a given metric name and label values:
node_cpu_seconds_total{cpu="0",mode="idle"}

# Select data from one day ago and shift it to the current time:
process_virtual_memory_bytes offset 1d

Note: label matchers (=: equality, !=: non-equality, =~: regex match, !~: negative regex match)
node_cpu_seconds_total{cpu!="1",mode=~"user|system"}

Rates of increase for counters

# Per-second rate of increase, averaged over last 5 minutes:
rate(demo_api_request_duration_seconds_count[5m])

# Per-second rate of increase, calculated over last two samples in a 1-minute time window:
irate(demo_api_request_duration_seconds_count[1m])

# Absolute increase over last hour:
increase(demo_api_request_duration_seconds_count[1h])

Aggregating over multiple series

Note: Available aggregation operators (sum(), min(), max(), avg(), stddev(), stdvar(), count(), count_values(), group(), bottomk(), topk(), quantile())

# Sum over all series:
sum(node_filesystem_size_bytes)

# Preserve the instance and job label dimensions:
sum by(job, instance) (node_filesystem_size_bytes)

# Aggregate away the instance and job label dimensions:
sum without(instance, job) (node_filesystem_size_bytes)

Math between series

Note: Available arithmetic operators (+, -, *, /, %, ^)

Add all equally-labelled series from both sides:

node_memory_MemFree_bytes + node_memory_Cached_bytes

# Add series, matching only on the instance and job labels:
node_memory_MemFree_bytes + on(instance, job) node_memory_Cached_bytes

# Add series, ignoring the instance and job labels for matching:
node_memory_MemFree_bytes + ignoring(instance, job) node_memory_Cached_bytes

# Explicitly allow many-to-one matching:
rate(demo_cpu_usage_seconds_total[1m]) / on(instance, job) group_left demo_num_cpus

# Include the version label from "one" (right) side in the result:
node_filesystem_avail_bytes * on(instance, job) group_left(version) node_exporter_build_info

Filtering series by value

Note: Available comparison operators (==, !=, >, <, >=, <=)

# Only keep series with a sample value greater than a given number:
node_filesystem_avail_bytes > 10*1024*1024

# Only keep series from the left-hand side whose sample values are larger than their right-hand-side matches:
go_goroutines > go_threads

# Instead of filtering, return 0 or 1 for each compared series:
go_goroutines > bool go_threads

# Match only on specific labels:
go_goroutines > bool on(job, instance) go_threads

Set operations

# Include any label sets that are either on the left or right side:
up{job="prometheus"} or up{job="node"}

# Include any label sets that are present both on the left and right side:
node_network_mtu_bytes and (node_network_address_assign_type == 0)

# Include any label sets from the left side that are not present in the right side:
node_network_mtu_bytes unless (node_network_address_assign_type == 1)

# Match only on specific labels:
node_network_mtu_bytes and on(device) (node_network_address_assign_type == 0)

Quantiles from histograms

# 90th percentile request latency over last 5 minutes, for every label dimension:
histogram_quantile(0.9, rate(demo_api_request_duration_seconds_bucket[5m]))

# 90th percentile request latency over last 5 minutes, for only the path and method dimensions:
histogram_quantile(
  0.9,
  sum by(le, path, method) (
    rate(demo_api_request_duration_seconds_bucket[5m])
  )
)

Changes in gauges

# Per-second derivative using linear regression:
deriv(demo_disk_usage_bytes[1h])

# Absolute change in value over last hour:
delta(demo_disk_usage_bytes[1h])

# Predict value in 1 hour, based on last 4 hours:
predict_linear(demo_disk_usage_bytes[4h], 3600)

Aggregating over time

Note: See all available <aggregation>_over_time() functions.

# Average within each series over a 5-minute period:
avg_over_time(go_goroutines[5m])

# Get the maximum for each series over a one-day period:
max_over_time(process_resident_memory_bytes[1d])

# Count the number of samples for each series over a 5-minute period:
count_over_time(process_resident_memory_bytes[5m])

Time

# Get the Unix time in seconds at each resolution step:
time()

# Get the age of the last successful batch job run:
time() - demo_batch_last_success_timestamp_seconds

# Find batch jobs which haven't succeeded in an hour:
time() - demo_batch_last_success_timestamp_seconds > 3600

Dealing with missing data

# Create one output series when the input vector is empty:
absent(up{job="some-job"})

# Create one output series when the input range vector is empty for 5 minutes:
absent_over_time(up{job="some-job"}[5m])

Manipulating labels

# Join the values of two labels with a - separator into a new endpoint label:
label_join(rate(demo_api_request_duration_seconds_count[5m]), "endpoint", " ", "method", "path")

# Extract part of a label and store it in a new label:
label_replace(up, "hostname", "$1", "instance", "(.+):(\\d+)")

Subqueries

# Calculate the 5-minute-averaged rate over a 1-hour period, at the default subquery resolution (= global rule evaluation interval):
rate(demo_api_request_duration_seconds_count[5m])[1h:]

# Calculate the 5-minute-averaged rate over a 1-hour period, at a 15-second subquery resolution:
rate(demo_api_request_duration_seconds_count[5m])[1h:15s]

# Using the subquery result to get the maximum rate over a 1-hour period:
max_over_time(
  rate(
    demo_api_request_duration_seconds_count[5m]
  )[1h:]
)

Recording rules

Recording rules allow you to precompute frequently used or computationally expensive queries.
Recording rules are defined in the Prometheus configuration file.
Recording rules are created using the record clause in a rule block.

Alerting

Alerting rules

Alerting rules allow you to define conditions for when an alert should be triggered.
Alerting rules are defined in the Prometheus configuration file.
Alerting rules are created using the alert clause in a rule block.
Alerting rules have a for clause that specifies the duration that the alert condition must be true before the alert is triggered.
Alerting rules have a labels clause that specifies the labels to add to the alert when it is triggered.

Alertmanager

Alertmanager is a tool that handles alerts sent by Prometheus.
Alertmanager can group, deduplicate, and route alerts to different receivers based on their labels.
Alertmanager can send alerts via email, PagerDuty, Slack, and other channels.

Instrumentation

Client libraries

Prometheus provides client libraries for popular programming languages such as Go, Java, Python, Ruby, and others.
Client libraries allow you to instrument your code to expose metrics to Prometheus.
Client libraries provide metric types such as counters, gauges, histograms, and summaries.
Client libraries provide features such as metric labeling, metric batching, and metric expiration.

Prometheus Python Client

Note: See the official GitHub repository for the Prometheus Python Client.

# Installation
>>> pip install prometheus-client

Note: This package can be found on PyPI.

# Example for pushing metrics to the prometheus pushgateway with a Python script:
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway

registry = CollectorRegsitry()
metric_name = "NAME"
metric_description = "DESCRIPTION"
metric_value = "METRIC_VALUE"
label_names = ["LIST_OF_LABEL_NAMES"]
label_values = ["LIST_OF_LABEL_VALUES"]

g = Gauge(metric_name, metric_description, label_names, registry = registry)
g.labels(*label_values)
g.set(metric_value)

push_to_gateway(gateway = "GATEWAY_URL", job = "JOB_NAME", registry = registry)

Exporters

Exporters are agents that collect metrics from third-party systems and expose them in a format that Prometheus can understand.
Exporters are available for systems such as Apache, MySQL, Node.js, Nginx, and many others.
Exporters are usually run as separate processes or containers.

Configuration

Configuration file

The Prometheus configuration file is a YAML file that specifies the configuration of the Prometheus server.
The configuration file specifies the targets to monitor, the scraping interval, the retention time, and other settings.
The configuration file also specifies alerting rules, recording rules, and other rules.

Command-line options

The Prometheus server has many command-line options that can be used to customize its behavior.
Command-line options can override settings in the configuration file.
Command-line options can be used to specify the configuration file location, the web interface settings, and other settings.

Prometheus Querying Basics

Prometheus Querying Operators

Prometheus Querying Functions

Prometheus Querying Examples

PREVIOUSAnsible Cheatsheet

NEXTBash Scripting