Prometheus introduction - How we show new colleagues the power of fire

A gentle introduction to the wonderful world of metrics

This blogpost is essentially a 1:1 copy of our internal how-to on the basics of Prometheus. While it’s relatively common for developers to hear about it, the actual deployments are still quite rare. The internal how-to aims to give a step-by-step introduction to the wonderful world of metrics, which can benefit developers and operations people alike. We felt this introduction could be of use to a wider audience than just our colleagues, so we decided to publish it almost as-is on our blog.

What is Prometheus

Prometheus is a time-series database, meaning that it’s a database that stores series of data in time. To be exact, Prometheus only stores floating-point values with double precision. To be able to address different kinds of data, all of them have to have a name and a set of optional labels that are key/value pairs.

Prometheus is written in Go, and, because of that, it consists of only one executable binary that can be run on any supported architecture that you set as your compile target.

Comparison with Nagios

Nagios, Icinga, and similar monitoring solutions usually perform what are called black-box checks. They run agents that are constantly peeking and prodding different machines and services, asking them “are you okay?”, and waiting for the response. While this is fine for basic alive/dead checks, it doesn’t help much when we need to diagnose the cause of the issue. There’s also the issue of NRPE plugins, which are basically remote code executions as a service. They allow the operator to run any command they wish with any parameters, but if configured incorrectly - which is often the case - they allow parameter injection and other forms of abuse.

Prometheus promotes white-box monitoring instead. The applications themselves produce all of the instrumentation data, with as much granularity as they want, producing much more comprehensive information than just alive/dead. Prometheus has become the de-facto standard for monitoring cloud native software - a lot of it exposes metrics in Prometheus style already, and we just need to collect them. That’s the case for Docker, etcd, Kubernetes, cAdvisor, and many more. There’s also a lot of middleware available for older software that has not yet begun exposing metrics.

Prometheus architecture

Prometheus architecture diagram

The heart of Prometheus lies in its server. This is the engine that collects all of the data and stores it in its internal tsdb, or “time-series database”. It also offers an API for querying and a very basic website to be able to make ad-hoc queries and explore the data. There are many ways to configure the scrape targets for Prometheus - all of the usual service discovery mechanisms like Consul, DNS, Kubernetes, Marathon are there. For all other mechanisms, we can use the file_sd option, which watches a JSON file on the disk for changes and reloads the targets whenever the contents of the file change. This is also used for our in-house machine based service discovery, where we have a cronjob that periodically checks our machine database for new targets, checks what exporters are running there, and updates the configuration files. We use consul-sd for our Nomad workloads.

Exposition format

When you look at the exposition format example below, you will notice that the first metric has the name node_cpu_seconds_total, a value 3.92238188e+06, and a set of labels inside curly braces {cpu="0",mode="idle"}. Labels can be omitted entirely together with the curly braces. They are separated by a single comma. The label value can be any UTF-8 string, so yes, you can use status=💩. Labels are very useful for compartmentalizing information, doing joins, and other useful stuff that we’ll mention later.

You should also include Help comments - lines that start with # - with every metric, so if the user of your metric is unsure what the metric actually means, they can easily read it. Don’t worry, the client libraries for programming languages make this very easy. Almost all exporters naturally-occurring in the wild have the help lines included.

You can read a more thorough documentation of the format on the official website.

# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 3.92238188e+06
node_cpu_seconds_total{cpu="0",mode="iowait"} 394.28
node_cpu_seconds_total{cpu="0",mode="irq"} 0
node_cpu_seconds_total{cpu="0",mode="nice"} 0
node_cpu_seconds_total{cpu="0",mode="softirq"} 26041.52
node_cpu_seconds_total{cpu="0",mode="steal"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 62279.98
node_cpu_seconds_total{cpu="0",mode="user"} 157790.08
node_cpu_seconds_total{cpu="1",mode="idle"} 3.92604158e+06
node_cpu_seconds_total{cpu="1",mode="iowait"} 407.08
node_cpu_seconds_total{cpu="1",mode="irq"} 0
node_cpu_seconds_total{cpu="1",mode="nice"} 0
node_cpu_seconds_total{cpu="1",mode="softirq"} 20548.33
node_cpu_seconds_total{cpu="1",mode="steal"} 0
node_cpu_seconds_total{cpu="1",mode="system"} 62158.23
node_cpu_seconds_total{cpu="1",mode="user"} 158312.68

Metric names follow a common structure:

  • node - application prefix. This means that the data came from node-exporter
  • cpu - the thing that is measured
  • seconds_total - unit and the keyword total which means it’s a value that keeps accumulating, i.e. a Counter (see Metric Types)

When designing your own metrics, please use the base units suggested in the documentation.

Basic exporters that we use

  • node-exporter - installed on every physical and virtual machine. Exports statistics about network, I/O, CPU, memory, filesystems, and so on.
  • cAdvisor - exposes CPU, memory, network and I/O usage from containers. These are installed on our nomad clusters, accessible under the container_ prefix.

We use many, many more exporters for different purposes, but the two mentioned above are universal and you can count on them being installed everywhere. We currently gather data from Nginx, Haproxy, varnish, Apache, Unbound (local resolver on every node), Socket2amqp, Elasticsearch and more. Feel free to explore the data as much as you want - when you type in the prefix, the query box will suggest metric names to you.

PromQL basics

Every time Prometheus scrapes metrics, it adds some metadata to each metric presented. One of them is called instance - this should be a machine name, container name, or something similar - basically, you need to be able to identify where the data came from. The other one is job - a collection of instances that produce the same data. So, we get one job for node-exporter, one job for nginx, one job for haproxy, and so on. We will rely on these labels all the time, so keep them in mind.

Let’s see some examples and focus on the metric we’ve seen before - node_cpu_seconds_total.

First steps

When you try to execute this query: node_cpu_seconds_total on our Prometheus instance, you’ll see a warning sign similar to this:

Warning! Fetched 58672 series, but displaying only first 10000

Don’t even try to click on the Graph tab yet - your browser will either crash or become extremely slow trying to show you 10,000 lines at the same time.

This happens because there are 8 different mode labels for each CPU core - most of our servers have more than 20 cores. So 160 different values for one machine. Seems like a lot, and it’s not even that useful in the first place - the value shows the number of seconds spent in all the different modes in all the different CPUs since boot.

Rate

Let’s move on and focus on a single machine for now - node_cpu_seconds_total{instance="phabricator"} - I’m using our Phabricator instance in the hope that this manual will be current even in the distant future. This seems much more comprehensive, so let’s click the Graph tab and we’ll see that these are basically just straight lines. This is because the values are constantly increasing, but they have been doing so for so long that the differences each second are too small to be seen here. We would like to see per-second differences, and that’s where the rate() function comes in handy: rate(node_cpu_seconds_total{instance="phabricator"}[5m]). Just applying it to the same data yields much more useful results.

The rate function takes metric and time information enclosed in [braces] as a parameter. For example: rate(node_cpu_seconds_total{instance="phabricator"}[5m]). The 5-minute parameter means that the rate function takes 5-minute chunks and calculates the per-second rate of the increase for each chunk by searching for the first and last metrics collected in the chunk and extrapolating further. So, the bigger the time frame, the smoother the graph looks - which is great for showing long-term trends without worrying about spikes.

On the other hand, if you want more accurate data to actually see the spikes, you can use irate instead. There’s a really good article written by Brian Brazil about the differences between these two.

Aggregation

We can use different aggregation functions to make more sense of the data we have. We can apply the sum function: sum(rate(node_cpu_seconds_total{instance="phabricator"}[5m])) to get the total number of CPUs in the machine. This should be a constant value, but might not be due to any number of different things (race conditions in measurements, virtual machines getting less CPUs after reboot, etc).

To get something much more useful, we can use the same sum function, but keep every label except cpu: sum(rate(node_cpu_seconds_total{instance="phabricator"}[5m])) without (cpu). This way, we see similar numbers to the ones we are used to from top command - percentages (in 0.0 - 1.0 interval, meaning 0% - 100%) - for each second and each processor mode.

Chances are, though, that most of the CPUs are idle, so we want to zoom in on the other modes and see them better. We just need to filter out the idle mode altogether: sum(rate(node_cpu_seconds_total{instance="phabricator", mode!="idle"}[5m])) without (cpu)

Filters & regexes

Until now, we’ve only graphed a single machine. We can also use a regex filter - =~ and !~ (not matching) in labels. So if we want to see the same as in the previous example, but for all of our beautiful CDN machines in Johannesburg, we use this filter: sum(rate(node_cpu_seconds_total{instance=~"za-jnbteraco-[0-9]+.*", mode!="idle"}[5m])) without (instance, cpu). We should see that, most of the time, the CPUs are doing I/O work, which is to be expected from CDN machines because that’s their job.

We can get basically the same graph by using by instead of without: sum(rate(node_cpu_seconds_total{instance=~"za-jnbteraco-[0-9]+.*", mode!="idle"}[5m])) by (mode) - sometimes one is more convenient than the other. by uses only the labels that you list, while without uses every label but the ones that you list. If there’s a metric with loads of labels but you only want to sum or average by a single one, it’s easier to use by than without and mentioning all the others.

What if we want to know which server is the most overloaded one - doing the most of iowait work - from the bunch we filtered above? That’s easy - just edit the filter to include iowait only and sum by instance label: sum(rate(node_cpu_seconds_total{instance=~"za-jnbteraco-[0-9]+.*", mode="iowait"}[5m])) by (instance) .

You can see from the examples above that even in using just a single metric, we can actually graph some pretty diverse data! That’s one of the most powerful features of Prometheus, and you should get used to handling the basics to be able to harness its full potential. You can find a lot of interesting calculations and graphs in our Grafana - if you’re unsure what the query means exactly, feel free to ask the authors of the dashboard or Stefan Safar.

More examples

We can also filter out values by comparing them to a constant. Let’s say we want to see every nomad host with less than 20GB memory available. We just use the data reported by nomad and get some interesting results: nomad_client_host_memory_free < 20*1024*1024*1024 .

We might also want to know which datacenter has the least amount of memory available in their nomad cluster: sum(nomad_client_host_memory_free) by (datacenter) . You can see that we don’t have to use the rate function here, because the amount of free memory is a Gauge, not a Counter. This means that the value can increase or decrease, so we’re interested in the current value rather than the rate.

If we want to see the minimum available memory on a node per each datacenter, we can use the min function instead: min(nomad_client_host_memory_free) by (datacenter) . You can see that we’ve now lost the information about which node is the lowest one, but we can still get that information by filtering one of the datacenters: nomad_client_host_memory_free{datacenter="za-jnbteraco-prod-01"} .

Upstream documentation

You can see all of the query functions in the official documentation.

Metric types

There are four basic metric types in Prometheus. Although everything is stored as a floating-point value, they differ in the meaning of the value that is stored, and you want to use different functions with different types to get sensible outputs. The two most common ones are Counter and Gauge - both used in the previous section. We’ll spend more time with Histograms, as they are a bit more complicated but can be tremendously useful.

Counter

Counter is a value that has one direction - up. This can be seconds spent in a CPU mode since boot, number of HTTP responses with 4xx status code, number of requests served, number of bytes/packets sent by network card, and so on. You usually want to see per-second increases in these values - that’s why you need the previously mentioned rate function. rate takes care of counter resets - if it detects that a value is lower than the previous one, it assumes there was a reset. Thanks to that, you don’t have to worry about them at all, and won’t see any graphs dropping to zero just because you needed to do a bit of maintenance (i.e. restart a daemon).

There are more functions that might come in handy - increase shows the increase between two points in time series, resets tells you the number of counter resets.

Gauge

Gauges are values that can move both up and down - things like memory stats, optical receive power on SFP interfaces, load, fan speed, and so on. But, never ever use rate on these metrics, as it will execute without hesitation, but you’ll get nonsensical results.

Interesting functions that you might find useful - changes - number of changes (up or down) in a given timeframe; delta - calculates the difference in value in a given timeframe; and predict_linear - extrapolates the current data to calculate expected value in future.

Histogram

Bi-modal histogram example

Histograms in Prometheus are an extension of the basic Counter type. For each bucket of values within a histogram, you create a new Counter value with a different le(less-or-equal-to) label. Prometheus client libraries make this easy for you - just hit histogram.observe(value) and you don’t have to update all of the Counters manually. As the less-or-equal-to suggests, the buckets are cumulative. This means that you can create a lot of different buckets in your exporter and safely drop them when ingesting with Prometheus to save storage space in exchange for a lower resolution.

http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.2"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423
http_request_duration_seconds_count 144320

As you can see in the example above, histograms automatically count the sum of all of the values and their count every time you hit observe(). This makes it easy to calculate the mean, and may come in handy in other situations as well.

histogram_quantile

While the histogram data is useful for visualizations, we also want to be able to monitor and alert on these metrics. Prometheus provides us with a histogram_quantile function that calculates a given quantile from the data it has. For example, for the above mentioned metric, we could use
histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m]))

Let’s break this down. rate function gives us a per-second value from the metric calculated over 5 minute window. We then use this as the input for histogram_quantile, along with 0.9 meaning 90th percentile. This query would result in value of around 0.55425. This means that 90 percent of queries are completed within 0.55425 seconds. But if we only have data about the number of queries under 0.5 and between 0.5 and 1 seconds, how can Prometheus know the exact value?
The answer is approximation. 0.9*144320=129888. We see that this falls between 0.5 and 1.0 buckets. It’s a bit over 10% between them. So we add the 10% to the lower bucket and we have our answer.

This means that the histogram_quantile will always give us an answer, with varying accuracy. But the answer would still be 0.55425, even if the true value would be really close to 1 second. This means that, while we can graph the value that histogram_quantile gives us, the value won’t mean much unless we have really small buckets. What we can do instead is use this knowledge to set up our buckets. If we want to alert on histogram_quantile, we should only ever do it after comparing it to a value that is equal to one of the buckets. This way we know for sure whether we are within our SLA or not. We also get the biggest upside of summaries - exact quantiles - with all the advantages of histograms, like being able to sum over them.

Histogram buckets

Histograms come with a set of exponential buckets by default. They are intended to cover a typical web/rpc request from milliseconds to seconds. While this is fine as a default, it’s usually in the “Not great, not terrible” area. If, for example, we want to measure network latency between servers in Germany and servers in South Africa, we know that the usual round trip time is around 175 milliseconds for Cape Town and 205 milliseconds for Johannesburg. We would expect that the quality of connection can vary a bit, so we would be really interested in the 100 to 400 millisecond area. We can create a bucket every 20 milliseconds for the aforementioned interval, and create 2-3 more on both sides. This way we get a much higher resolution for the most probable buckets, while keeping the bucket count relatively low on the outskirts.

Bucket sizes give us huge potential for customization to specific use cases, but they come with a cost. Large amount of buckets brings huge load for the Prometheus server and different bucket sizes may cause confusion, especially when displayed in Grafana. It might take a while getting used to this, and you should point it out at least to people being on-call for the first time.

Histograms & Grafana

Screenshot of an opacity-based heatmap in Grafana
There are two basic settings for heatmaps in Grafana - opacity and spectrum. Opacity, as its name suggests, bases the opacity of the selected color on the number of samples in the bucket. This setting allows us to use sqrt as the scaling function and tune its exponent to our liking. That means that we can tune it so low that we actually see every missed packet, which can be quite useful for visualizing packet loss, as you can see above. We used 0.2 as the exponent on this graph. There should be almost no packet loss during normal conditions, so we can react accordingly very quickly. We can usually measure increased latency when our upstream provider has an outage and re-routes traffic through other network almost immediately, and we know something is going on even before they let us know.

Screenshot of a spectrum-based heatmap in Grafana
The other setting is called spectrum. There are a lot of different color spectrum options to choose from, but some stand out more than others. You can see YlGnBu in action in the graph above. Don’t be afraid to try them all out - different options will make the data stand out on different datasets. Just try to avoid red/green combinations, which is the most common combination for color blindness. In our experience, green combinations work best, and all options that include black seem to underperform.

Summary

Summaries are a special use-case. They need to be supported by the client library. You give them a quantile (like 0.9 in the above histogram example meaning 90th percentile), and observe the data the same way you would with a histogram. The client library keeps track of the observations and keep accurate quantile values at all times. This may be handy if you have a hard SLA that you need to measure perfectly.

However, the disadvantages are myriad. You cannot aggregate them in any meaningful way, they might be a big burden on the client library (which should always have as little overhead as possible), and they need to be pre-configured before deploying, with any change requiring a change in configuration. It’s because of this that summaries are very rarely used.

For more discussion about histograms vs summaries, please refer to the awesome official documentation.

Client libraries

There are many client libraries for different languages. You can take a look at the official list, which lists official supported languages by the core Prometheus team, as well as various third party libraries.

The aim of the client libraries is to ease the implementation of metrics in your project. Take a look at the Python client library - there are some useful examples that will help you get started quickly.

Recording & alerting rules

Sometimes you need to create a really complicated query and watch it in Grafana. This might happen when you aggregate over a lot of source metrics (i.e. sum every container’s CPU usage). If the query takes too long to execute and you want to graph it, there are recording rules that can do exactly this. We store these together with alerts in the rules/ subdirectory in a Git repository - you just need to create a diff and merge it afterwards. Recording rules execute on the Prometheus server periodically and store the results as a new metric that you can use as any other metric. There are naming conventions you should follow to make it easier for everyone else to use your metrics.

  - alert: SwitchPortDown
    expr: ifOperStatus != ifAdminStatus
    for: 1m
    labels:
      severity: critical
      team: cdn
    annotations:
      description: Port {{ $labels.ifName }} - {{ $labels.ifAlias }} is down, this may result in connectivity loss.
      runbook: SwitchPortDown
      summary: Port is down on {{ $labels.job }}
      title: SwitchPortDown

With alerts, things are a bit different. You prepare a query and every time the query returns a result, the alert is triggered. It might not emit an alert immediately - you can control this via the for: 5m directive, which ensures that the alert is only triggered if the query returns the result for 5 minutes. This is measured for each individual result. Take a look at our existing alerts for more information on naming. You can also get inspired on the “Awesome Prometheus Alerts list”.

When Prometheus creates an alert, it just sits in there unless you configure an Alertmanager. We have a highly-available pair of Alertmanagers. Their job is to handle alerts from different Prometheus instances, group them, de-duplicate them, and route them to the correct destinations. We currently support sending alerts to a Slack channel, an email address, and to OpsGenie, which we use to page on-call team members. You can control which alerts get sent where in the Alertmanager configuration.

Alertmanager is fine, but it tends to get a bit crowded when there are a few alerts firing. That’s why we also deployed Karma, which acts as a front-end to Alertmanager and makes it more easy to silence alerts, see how long they’ve been firing, and so on.

Advanced tips & tricks

Set operators, vector matches

I don’t want to just copy and paste, so if you need to filter one set of metrics for those that are present in another one, create a union of two different metrics, multiply one set of data with another one based on only a few labels, and take a look at the upstream documentation, which gets into the right amount of detail on these.

Aggregation over time

Aggregations over time are great for dashboard building. They make it possible to evaluate the min/max/average rate of network transfers over the last 24 hours, 95th quantile of HTTP response time in the past week, and so on. They are quite easy to use, just keep in mind that if you want to use them in Grafana, you have to tick the Instant checkbox that is located under the query. This makes sure that Grafana gets only one current value instead of a set of values within the current time-range specified in Grafana.

predict_linear, changes, label_replace, label_join

  • predict_linear can be used to easily alert for things that will happen in the near future. It takes the closest value to the time you select, the current value, and extrapolates the future value from these points using linear algebra. You can use that to create an alert when free space on a device will be less than 5% in 12 hours, which should start alerting during the day so you can fix any issue long before going home (and you won’t get paged in the middle of the night). You still need to have an alert for critical disk space (i.e. 0.5% or so), but this shouldn’t happen too often thanks to the predict_linear rule.
  • changes tells you how many times the metric changed value in the given timeframe. You can use this to detect flapping services.
  • label_replace helps when you have two very similar metrics and you want to unify your view of them. You can see it in action in our SFP Inventory dashboard where we needed to combine data from Cisco devices from SNMP with arista devices gathered from Arista eOS Exporter. We just renamed the arista labels with the same values so they match the SNMP ones:
    label_replace(label_replace(sum(arista_sfp_stats) without (sensor), "entPhysicalName", "$1", "device", "(.*)"), "entPhysicalDescr", "$1", "mediaType", "(.*)")
  • label_join is similar to label_replace, but joins multiple labels together to create a new label. You can stack these functions after each other to create your desired output.

While label_replace and label_join are fine for some ad-hoc metrics, they should be used as a last resort - if you can, you should fix the source data to be consistent everywhere. If you cannot do that, there’s always relabeling, which we’ll cover in the next section.

relabel_configs, metric_relabel_configs

Both of these are configured per-job. The big difference between them is that relabel_configs are applied before scraping, so you can use this to ‘redirect’ the scrape to a different target - i.e. ping the set of hosts via Blackbox Exporter. On the other hand, metric_relabel_configs are used for post-processing metrics that come from an exporter. You can drop some labels or even metrics, change their name, and many other mutations. Just don’t overuse this feature too much because it has to be executed with every scrape, so if at all possible, you should prefer to fix the exporter instead of relying on the heavy use of relabeling.

Summary

We hope you found this summary useful. If you feel we missed something critical, would like to know more about Prometheus or something else we do, or possibly work with us, feel free to contact us (contact details are below).

Please check the original version of this article at