This post will take you through our journey of debugging a production issue that eventually led us to produce a new
ethtool data exporter for Prometheus. We open-sourced this new exporter and it lives in our repository on Github. As always - patches, issues, and comments are more than welcome.
A few weeks ago, we experienced intermittent packet loss issues with some of our servers. We detected these, in most cases, during peak times - which is always a bad sign. The peaks only happened once or twice a week and, worst of all, they were not contained to a single datacenter. We measure loss and latency with internal tool called
prometheus-pinger and visualize with tool called
bloodbath (stay tuned, we are working on open-sourcing them as well).
Bloodbath showing network issues between servers
The difficult part about debugging issues like these is that you usually need to debug while the issue manifests itself. If it’s already gone by the time you open your laptop and fire up the VPN, you can usually only look at the logs and metrics and then hang your head and cry.
At first, we thought that the upstream links from South Africa to Germany were filling up. We investigated this possibility thoroughly and asked our upstream provider for their assistance, which was spot-on and timely. They pointed out that they were not seeing any packet loss between their routers, which meant that the culprit was inside our own network. We quickly investigated whether this could be caused by our switches because we were seeing packet loss even on the first hop, which should never happen. We checked all the cables, transceiver details on the switches, and servers, but everything seemed fine.
Right around that time we realized that this had to be an issue on the server side. We suspected that it had something to do with our distribution upgrade (we are using Debian GNU/Linux), which happened a few weeks before the issue appeared. We noticed that when we downloaded data from servers from Germany, which is mostly ingress, our egress went down by almost the same amount. We then used our Prometheus-fu to conjure a graph of combined ingress+egress traffic and, lo and behold, this value seemed capped - ingress traffic was affecting egress traffic.
Graph showing ingress and egress traffic of an affected node
The next time this happened, we were armed and ready. We hacked together an ugly python script that parsed
ethtool output and saved it to a
.prom file so that node-exporter could pick it up.
We checked the
ethtool output and confirmed that the network cards were sending pause frames and/or dropping packets in their queues. This was happening because the kernel couldn’t handle any more packets. We then finally found out that we had exactly one cpu loaded full of
softirq, and that this was also the source of the bandwidth capping. All of the
softirqs were sent to a single CPU - CPU0.
Graph showing dropped packets / pause frames on two interfaces on an affected node
Further investigation concluded that we had somehow lost the
softirq load balancing for a single vendor of network cards after we upgraded distributions. We wrote a script and a
systemd unit file so that the script would run exactly once after reboot. The issue is now completely gone.
We also found out that some of the information from
ethtool might be useful for others. We found a Prometheus exporter for ethtool that already existed, but it had a few shortcomings. For one, it didn’t appear to be maintained anymore - the readme file mentioned Ubuntu 14.04 which, while still supported, is pretty dated. We also discovered that it used ‘interface’ for the interface name instead of the regular ‘device’ that
node-exporter uses, which makes it needlessly inconsistent. Above all, we wanted to leverage our current infrastructure with
node-exporters instead of setting up more scrapes for Prometheus.
We currently run this exporter on all of our CDN servers and plan to use it on every server that has at least a 10gbit/s network card. We find the data to be quite useful, but, because the source data is supplied by the device drivers, your mileage may vary. There is also almost no standardization, so if you use multiple network card vendors, you have to examine the data closely to find out what is useful to you and set up your alerts and dashboards accordingly.
We hope you find this new exporter useful. Let us know what you think!