Adolfo Ochagavía

Continuous benchmarking for rustls

Last December, I completed a half-year project to develop a continuous benchmarking system for the popular rustls library. My work was financed by ISRG, the makers of Let’s Encrypt, who are interested in rustls as a memory safe alternative to OpenSSL. The thing is, replacing OpenSSL is only realistic if you offer at least on-par performance. But how do you achieve that? What do you measure to ensure performance keeps improving and to avoid regressions?

Having been around for a long time in the Rust community, this problem immediately reminded me of the heroic efforts to speed up the Rust compiler in the past years. Next to coming up with suitable benchmarks, the compiler developers set up a system to automatically benchmark pull requests and report actionable results (see this article if you are curious). The reports help maintainers make informed decisions regarding performance when reviewing PRs (e.g. asking for changes because of a performance regression; or confirming that an optimization actually paid off). The system works in an analogous way to automated testing: it helps maintainers achieve confidence about the changes to the code. There is an additional challenge, though, because performance measurements tend to be very noisy.

With the above in mind, this is how I summarized our objective in rustls’ issue tracker (issue 1385):

It would be very useful to have automated and accurate feedback on a PR’s performance impact compared to the main branch. It should be automated, to ensure it is always used, and it should be accurate, to ensure it is actionable (i.e. too much noise would train reviewers to ignore the information). The approach used by rustc [the Rust compiler] is a good example to follow, though its development required a daunting amount of work.

After some initial research, I developed a design and discussed it with Nick Nethercote and Jakub Beránek. Both have been heavily involved in the development of the benchmarking setup for the Rust compiler, so I very much wanted to pick their brains before moving forward. Armed with their feedback and encouragement, I set out to create a somewhat similar system for rustls… and it worked! It has been live for a few months already.

Trophy case

Before going into the design itself, I can’t pass on the opportunity to show our current “trophy case”. These are examples of how the benchmarking system is already helping drive development of rustls:

High-level overview and source code

If you are feeling adventurous, you can follow the step-by-step development of the benchmarking setup through these four issues in the issue tracker (and their associated pull requests). That’s asking a lot, so below is a summary of the final design for the rest of us:

  1. Hardware: the benchmarks run on a bare-metal server at OVHcloud, configured in a way that reduces variability of the results.
  2. Scenarios: we exercise the code for bulk data transfers and handshakes (full and resumed1), with code that has been carefully tuned to be as deterministic as possible.
  3. Metrics: we measure executed CPU instructions and wall-clock time (the former because of its stability, the latter because it is the metric end users care about).
  4. Reporting: once a benchmark run completes, its respective pull request gets a comment showing an overview of the results, highlighting any significant changes to draw the reviewer’s attention (here is an example). Cachegrind diffs are also available to aid in identifying the source of any performance difference.
  5. Tracking: each scenario keeps track of measured performance over time, to automatically derive a significance threshold based on how noisy the results are. This threshold is used during reporting to determine whether a result should be highlighted.

For the curious, the code for each benchmarked scenario is in the main rustls repository, under ci-bench. The code for the application that coordinates benchmark runs and integrates with GitHub lives in its own repository.

What about OpenSSL?

The continuous benchmarking system described above is ideal to track performance differences among versions of rustls, but it cannot be used to compare against OpenSSL2. Still, I did benchmark rustls against OpenSSL using a different method (see this post for details). The results show that in many scenarios rustls is faster and less memory hungry, but there are many areas too where it falls behind OpenSSL (not for long, hopefully!).

Aside: shoutout to cachegrind

When developing the continuous benchmarks, one of the biggest challenges was to make them as deterministic as possible. The cachegrind tool was immensely valuable for that purpose, because it allows counting CPU instructions and diffing the results between two runs. That way you can see exactly which functions had a different instruction count, helping identify the source of non-determinism. Some of them were obvious (e.g. a randomized hash map), others were tricky to find (e.g. non-deterministic buffer growth). Thanks for this marvellous piece of software! It made me feel like a wizard.

Parting words

This was one of those contracts you feel afraid to accept because they are out of your comfort zone, yet end up taking in the hope you’ll figure things out. Fortunately, I was able to deliver the desired results while learning a lot. It even got me a glowing recommendation on LinkedIn by one of the founders of Let’s Encrypt, which to me is a true honor. A great way to close the year 2023!

  1. It is important to test both from-scratch (or full) and resumed handshakes, because the performance characteristics of the two are very different. ↩︎

  2. For one, CPU instruction counts are an unsuitable metric when comparing totally different codebases. Using the secondary wall-clock time metric is not an option either, because the scenarios are tweaked for determinism and to detect relative variations in performance, not to achieve the maximum possible throughput. ↩︎