eBPF-exporter turns memory errors into metrics

Hardware errors become frequent when the infrastructure has hundreds of servers. Everything is less or more understandable with disks: it dies, lags or increments S.M.A.R.T. counters. Memory errors are of more interest.

EDAC

Memory errors are detected and some of them are corrected with the help of ECC algorithms. It works in two situations:

When data is being read from memory.
During scrub operation. Which is a periodic read operation followed by data check with ECC.

As a result, there are two types of errors: correctable and uncorrectable. In the first case errors are not only detected, but are also corrected. Unfortunately, this procedure is not free. If a server has to correct a lot of errors it behaves laggy. Also, the probability of an uncorrectable error event grows. Hence, we have a task to catch these errors and replace the memory DIMM. When the Linux kernel meets an uncorrectable error, it panics.

ebpf-exporter

I once told how to work with eBPF with an example program for Linux block layer tracing. I worked a lot with prometehus and trying ebpf-exporter was always an exciting idea. Just thinking of converting eBPF-programs output into metrics. But there was no serious case where we had to run this tool in production until we faced that we cannot use graylog2 alerts for memory errors anymore.

The problem was that at scale, it’s hard to predict the load on the logging system. Even a minor bug at any moment may result in tons of messages sent. This one and some other problems lead to an idea that alerting based on unstructured logs is an evil. The system may send an alert after minutes or even hours after the event if it’s overloaded.

Node-exporter has memory error collector, but it works with an old sysfs interface. For it to work one has to build the kernel with CONFIG_EDAC_LEGACY_SYSFS. It’s better not to do that and find another way that works.

To find another solution we have to look at how this subsystem works. Then, looking at the source code, one can see the tracepoint, which runs when the OS gets information about memory error. Obviously we can use this to turn the number of EDAC errors into metric and stop parsing logs.

To collect the information from mc_event tracepoint the rasdaemon has been written. It’s possible to use it, but the information should be brought to the desired form. Ebpf-exporter solves this problem and may also be used to solve many other tasks.

Not a rocket science here — build a test kernel with CONFIG_EDAC_DEBUG, write a proof of concept with bpftrace, inject some errors and just verify that we can use this tracepoint. Then write the bcc-based tool and adopt it for the ebpf-exporter.

The future has come finally. And now memory errors are metrics:

ebpf_exporter_mc_event_total{
        label="CPU_SrcID#0_MC#0_Chan#1_DIMM#0",
        lower_layer="255", mc_index="0",
        middle_layer="0", msg="memory read error",
        top_layer="1", type="err_corrected"} 2
ebpf_exporter_mc_event_total{
        label="CPU_SrcID#0_MC#0_Chan#1_DIMM#0",
        lower_layer="255", mc_index="0",
        middle_layer="0", msg="memory scrubbing error",
        top_layer="1", type="err_corrected"} 1

What’s next?

eBPF-exporter allows us to collect interesting stats from the server. Not only from the software, but the hardware too. The repository has an example of the program that collects the number of CPU cycles and instructions executed. This type of metrics demands on the knowledge of the Linux kernel and CPU architecture, but it’s worth it.

With such metrics, it’s possible to measure the degradation after an upgrade. It’s possible to characterize load accurately. Due to eBPF high efficiency there is an opportunity to get an information about concrete function execution in the kernel on production servers during the root cause analysis without significant impact on the workload.