Intended audience: Beginner to intermediate
Last February, Roman Fiser, Showmax Head of Infrastructure, gave a presentation at the P2D2 conference - “Monitoring PostgreSQL - monitoring with a modern tech stack”.
We’ve decided to make the presentation free to the public - including all of the Alertmanager rules and Grafana Dashboards that we are currently using in production. And to tease you to continue reading, in an upcoming post, we will take a step beyond the current status and will share our vision of the future monitoring ecosystem.
As presented at P2D2 last year, our monitoring revolved around Prometheus working with PostgreSQL. Prometheus is an open-source solution that implements a highly dimensional data model. We chose to use it because it fundamentally stores all data as time series, the streams of timestamped values belonging to the same metric, and thanks to Prometheus we could implement a monitoring based on metrics with metadata (CPU, RAM, disk IO, disk utilization, etc.).
To be able to address different kinds of data, all of them have to have a name and a set of optional labels that are key/value pairs. Prometheus is written in Go, and, because of that, it consists of only one executable binary that can be run on any supported architecture that you set as your compile target.
We already published the Prometheus introduction where you may learn more. And in the shared p2d2 presentation you may find the main benefits for monitoring, here is a quick recap:
- Functions to filter, change, remove …. metadata while fetching them
- Multiple exporters - expose data via HTTP API
- Effective data fetching:
- Based on intervals measured in seconds
- Million of data points
- Notifications can be reported via Email, Slack, etc.
We set alert manager rules based on four SRE golden signals. The most important metrics which are latency, traffic, saturation and errors.
|Metric||Definition||What we monitor|
|Latency||The time it takes to service a request||pg_stat_statements - Average call time
pg_stat_statements - Maximum call time
|Traffic||A measure of how much demand is being placed on your system||pg_stat_statements - Calls
pg_stat_statements - Returned rows
System IO Statistics (IOPS, traffic)
|Errors||The rate of requests that fail||Rollback / Commit ratio
|Saturation||How “full” your service is||num_backends / max_connections
High IO utilization (iowait, await)
High CPU utilization
Within the logging pipeline we placed Elastic as the search and analytics engine, RabbitMQ is used for log messages delivery, Logstash for server‐side data processing and Kibana for visualizing data with charts and graphs in Elasticsearch. You may see the logging pipeline in detail on the below scheme.
Within the presentation you may find a comprehensive summary of the tech stack we use including:
- Specific examples of Prometheus Alertmanager rules for alerting
- A comprehensive walkthrough of Grafana dashboards
- Details on our logging pipeline, including our ELK stack and a few other