Continuous benchmarking for rustls

05 Jan, 2024

Last December, I completed a half-year project to develop a continuous benchmarking system for the popular rustls library. My work was financed by ISRG, the makers of Let’s Encrypt, who are interested in rustls as a memory safe alternative to OpenSSL. The thing is, replacing OpenSSL is only realistic if you offer at least on-par performance. But how do you achieve that? What do you measure to ensure performance keeps improving and to avoid regressions?

Having been around for a long time in the Rust community, this problem immediately reminded me of the heroic efforts to speed up the Rust compiler in the past years. Next to coming up with suitable benchmarks, the compiler developers set up a system to automatically benchmark pull requests and report actionable results (see this article if you are curious). The reports help maintainers make informed decisions regarding performance when reviewing PRs (e.g. asking for changes because of a performance regression; or confirming that an optimization actually paid off). The system works in an analogous way to automated testing: it helps maintainers achieve confidence about the changes to the code. There is an additional challenge, though, because performance measurements tend to be very noisy.

With the above in mind, this is how I summarized our objective in rustls’ issue tracker (issue 1385):

It would be very useful to have automated and accurate feedback on a PR’s performance impact compared to the main branch. It should be automated, to ensure it is always used, and it should be accurate, to ensure it is actionable (i.e. too much noise would train reviewers to ignore the information). The approach used by rustc [the Rust compiler] is a good example to follow, though its development required a daunting amount of work.

After some initial research, I developed a design and discussed it with Nick Nethercote and Jakub Beránek. Both have been heavily involved in the development of the benchmarking setup for the Rust compiler, so I very much wanted to pick their brains before moving forward. Armed with their feedback and encouragement, I set out to create a somewhat similar system for rustls… and it worked! It has been live for a few months already.

Trophy case

Before going into the design itself, I can’t pass on the opportunity to show our current “trophy case”. These are examples of how the benchmarking system is already helping drive development of rustls:

PR 1448: introducing dynamic dispatch for the underlying cryptographic library was necessary to make the API more user-friendly, but maintainers were concerned about potential performance regressions. The automated benchmark report revealed that the change had a mildly positive effect on handshake latency, and no effect at all in other scenarios. With this, maintainers were able to merge the pull request with confidence.
PR 1492: a security feature was introduced to zeroize fields containing secrets, which was expected to have some performance impact. The automated benchmarks showed that the regressions were manageable (between 0.5% and 0.85% for resumed handshake latency, and lower to no impact in other scenarios). Again, this information allowed the maintainers to merge the pull request with confidence. Quoting ctz: [there] was a clear security/performance tradeoff, and being able to transparently understand the performance cost was very useful.
PR 1508: upgrading the *ring* dependency, which Rustls uses by default for cryptographic operations, caused an up to 21% regression for server-side handshake latency. After some investigation and discussion with *ring*’s maintainer, we concluded that the regression was due to missed optimizations in GCC. The regression was filed to BoringSSL and GCC issue trackers, but there is currently no planned fix. The recommended solution is to compile *ring* using Clang, or to use a different cryptographic library such as aws-lc-rs.
PR 1551: a refactoring caused a mild regression for handshake latency, but it was caught during review thanks to the automated benchmarks. The regression was promptly fixed, and the fix even resulted in a mild performance improvement.

High-level overview and source code

If you are feeling adventurous, you can follow the step-by-step development of the benchmarking setup through these four issues in the issue tracker (and their associated pull requests). That’s asking a lot, so below is a summary of the final design for the rest of us:

Hardware: the benchmarks run on a bare-metal server at OVHcloud, configured in a way that reduces variability of the results.
Scenarios: we exercise the code for bulk data transfers and handshakes (full and resumed¹), with code that has been carefully tuned to be as deterministic as possible.
Metrics: we measure executed CPU instructions and wall-clock time (the former because of its stability, the latter because it is the metric end users care about).
Reporting: once a benchmark run completes, its respective pull request gets a comment showing an overview of the results, highlighting any significant changes to draw the reviewer’s attention (here is an example). Cachegrind diffs are also available to aid in identifying the source of any performance difference.
Tracking: each scenario keeps track of measured performance over time, to automatically derive a significance threshold based on how noisy the results are. This threshold is used during reporting to determine whether a result should be highlighted.

For the curious, the code for each benchmarked scenario is in the main rustls repository, under ci-bench. The code for the application that coordinates benchmark runs and integrates with GitHub lives in its own repository.

What about OpenSSL?

The continuous benchmarking system described above is ideal to track performance differences among versions of rustls, but it cannot be used to compare against OpenSSL². Still, I did benchmark rustls against OpenSSL using a different method (see this post for details). The results show that in many scenarios rustls is faster and less memory hungry, but there are many areas too where it falls behind OpenSSL (not for long, hopefully!).

Aside: shoutout to cachegrind

When developing the continuous benchmarks, one of the biggest challenges was to make them as deterministic as possible. The cachegrind tool was immensely valuable for that purpose, because it allows counting CPU instructions and diffing the results between two runs. That way you can see exactly which functions had a different instruction count, helping identify the source of non-determinism. Some of them were obvious (e.g. a randomized hash map), others were tricky to find (e.g. non-deterministic buffer growth). Thanks for this marvellous piece of software! It made me feel like a wizard.

Parting words

This was one of those contracts you feel afraid to accept because they are out of your comfort zone, yet end up taking in the hope you’ll figure things out. Fortunately, I was able to deliver the desired results while learning a lot. It even got me a glowing recommendation on LinkedIn by one of the founders of Let’s Encrypt, which to me is a true honor. A great way to close the year 2023!

It is important to test both from-scratch (or full) and resumed handshakes, because the performance characteristics of the two are very different. ↩︎
For one, CPU instruction counts are an unsuitable metric when comparing totally different codebases. Using the secondary wall-clock time metric is not an option either, because the scenarios are tweaked for determinism and to detect relative variations in performance, not to achieve the maximum possible throughput. ↩︎