Understanding Prometheus - Metrics, Data Types, and Querying

An example of Prometheus data:

http_request_total{method="GET", endpoint="/contact-us", status="200"} 1 2 3 4 5 
http_request_total{method="POST", endpoint="/auth", status="400"} 1 6 8 10 15

Key Concepts
#

Metric: Quantity measurement (e.g.: http_request_total)
Metric label: Metadata for the measurement (e.g.: method="GET")
Sample: Data point at a certain time (e.g.: 5) - float64
Series: Unique combination of metric labels (e.g.: http_request_total{method="GET", endpoint="/contact-us", status="200"} and http_request_total{method="POST", endpoint="/auth", status="400"})
Time series: Samples over time (e.g.: 1 2 3 4 5)

Data Types
#

Instant vector: http_request_total{method="GET"}
Range vector: http_request_total{method="GET"}[5m]
Scalar: numbers

Metric Types
#

Prometheus supports four metric types:

Gauge: Values can go up and down (e.g.: logged_users)
Counter: Values can only increase (e.g.: http_request_total)
Histogram: Provides <metric_name>_bucket, <metric_name>_sum, <metric_name>_count. Use histogram_quantile() for server-side quantile calculation (e.g.: http_request_duration_seconds)
Summary: Similar to Histogram, but quantiles are calculated client-side (application). Thus, it cannot be further aggregated.

promql
#

Operator Precedence
#

Prometheus supports a range of binary operators with different precedence levels. From highest to lowest precedence:

^
*, /, %, atan2
+, -
==, !=, <=, <, >=, >
and, unless
or

Reference

Modifiers
#

@ 1609746000 - pretend the query time is 1609746000
offset 5m - pretend the query time is 5 minutes ago

Have to use right after the select (before any function call)

Vector Matching
#

Vector scalar:
- Example: http_request_total / 2
Vector Vector:
- Types of matching:
  - One-to-One
  - One-to-Many
  - Many-to-One
- Matches vectors using labels by default
- Customize matching key with ignore() or in()
- Use group_right() or group_left() for many side
- Use group_left(labels) to bring labels from one to many side

Example:

method_code:http_errors:rate5m{method="get", code="500"}  24
method_code:http_errors:rate5m{method="get", code="404"}  30
method_code:http_errors:rate5m{method="put", code="501"}  3
method_code:http_errors:rate5m{method="post", code="500"} 6
method_code:http_errors:rate5m{method="post", code="404"} 21

method:http_requests:rate5m{method="get", foo="bar"}  600
method:http_requests:rate5m{method="del", foo="bar1"}  34
method:http_requests:rate5m{method="post", foo="bar2"} 12

method_code:http_errors:rate5m{code="500"} / ignoring(code) group_left(foo) method:http_requests:rate5m

{method="get", code="500", foo="bar"} 0.04 
{method="get", code="404", foo="bar"} 0.05 
{method="post", code="500", foo="bar2"} 0.05 
{method="post", code="404", foo="bar2"} 0.175

If no group_left(foo), foo=”bar” will gone

Reference

Common Prometheus Functions
#

changes(): Number of changes over time
time(): Current timestamp
timestamp(): Timestamp of the sample
Derivative and Rate:
- deriv(): gauge;
- rate(), irate(): counter
Delta and Increase:
- delta(), idelta(): gauge
- increase(): counter
irate() vs rate():
- irate(): (last - first datapoint)/time range
- rate(): (projected end - start time datapoint)/time range
Aggregration:
- <aggregation>: sum, count, max, min, avg, etc: Aggregates across dimensions (group by labels)
- <aggregation>_over_time(): Aggregates across time (group by time)

Examples:

sum(http_request_total)

Result:

{} 9

sum_over_time(http_request_total{method="GET"}[5m])

Result:

{method="GET", endpoint="/contact-us", status="200"} 10 # 1+2+3+4+5
{method="POST", endpoint="/auth", status="400"} 25 #1+6+8+10+15

Reference

Prometheus Client Library Usage
#

Instrumentation
Writing exporters
Pushing metrics to Pushgateway

Gist reference

Storage
#

Not recommended to use NFS for storage: reference for storage

Agent Mode
#

Disables query, alert, and recording rule functions
Scrapes metrics from target and remotely writes to other instances
Reference

Service Discovery
#

Static: Define target servers in the config file
*_sd_config: Use built-in configurations (e.g.: EC2, Kubernetes, file)
Custom: Use file_sd_config. Update the file periodically.

Each scrape config can have:

interval
timeout
proxy
metrics_path

Relabeling
#

relabel_configs: Modify scrape parameters before scraping (e.g.: Blackbox exporter)
metrics_relabel_configs: Modify data collected after scraping (e.g.: remove unwanted metrics)

Alerting in Prometheus
#

Evaluates rules, fires alerts, routes to destination
Does not handle notifications
Routes by matching rules with labels
Labels: alert identity
Annotations: longer-form description
Annotations support templating with go lang syntax
Reference labels in annotations can be done by {{ $labels.foo }}

Alertmanager
#

Silencing alerts use cases:

Provisioning new servers
Decommissioning servers
Maintenance

Inhibiting:

Stop a group of alerts when another alert is triggered
Example: Cluster down alert inhibits memory or disk check alerts

Key Concepts#

Data Types#

Metric Types#

promql#

Operator Precedence#

Modifiers#

Vector Matching#

Common Prometheus Functions#

Prometheus Client Library Usage#

Storage#

Agent Mode#

Service Discovery#

Relabeling#

Alerting in Prometheus#

Alertmanager#

Key Concepts
#

Data Types
#

Metric Types
#

promql
#

Operator Precedence
#

Modifiers
#

Vector Matching
#

Common Prometheus Functions
#

Prometheus Client Library Usage
#

Storage
#

Agent Mode
#

Service Discovery
#

Relabeling
#

Alerting in Prometheus
#

Alertmanager
#