Adolfo Ochagavía

The birth of a package manager

Since my time at the university, pursuing a Computer Science degree, I have always been fascinated by programming languages and the tooling around them: compilers, IDEs, package managers, etc. Eventually, that got me involved as a hobbyist in the development of the Rust compiler and rust-analyzer, but I never got the chance to work professionally on programming language tooling… until two months ago! In January, the nice folks at prefix.dev asked me to help them develop the rattler package manager, and there is lots to tell about what we have achieved since then, so buckle up!

Ehm… what is rattler?

Good question! Here is what the official announcement has to say:

[Rattler is] an open-source Rust library for working with the conda ecosystem.

If you are like me and have never heard of “the conda ecosystem” before, this description might leave you with more questions than you already had. The conda rabbit hole is deep, but we can get quite far with an oversimplification: the conda community maintains a repository of software packages1. Rattler is able to, given a set of dependencies, determine which exact versions need to be installed, and then proceed to install them in a virtual environment. You can also use it as a CLI, as shown below2, where we see rattler installing the cowpy package to a virtual environment, including a suitable version of python that is actually able to run cowpy (note: cargo run --release is running the rattler CLI):

Technical challenges

Working on rattler, and probably on any package manager, brings quite a few interesting challenges on one’s path. Let’s have a look at the most relevant two!

Dependency resolution

As a user of a package manager, you want to specify the names of the packages you are interested in, and maybe some additional constraints, like which versions are allowed. For instance, if you want to install numpy but are required to use Python 3.7, your package manager should take that into account and tell you to use numpy version 1.21.5 (instead of version 1.24.2, which is the newest one at the time of this writing, but no longer supports Python 3.7).

In the conda ecosystem, every package (e.g. numpy) has multiple versions (e.g. 1.21.5, 1.24.2, etc). But it doesn’t end there! Even for a particular version, there might be more than one build. Python libraries such as PyTorch, for instance, provide different builds depending on which GPU you have (e.g. does it support CUDA?). This and other factors make it complex to resolve dependencies.

Fortunately, the problem of dependency resolution has been studied for a while, and there are production-grade open source solvers suited for the task. We are currently using a fork of libsolv, which relies on the technique of SAT solving. It is not perfect (software never is), but gets the job done.

One interesting avenue of future work is to try to replace libsolv by a solver written in Rust, such as PubGrub. That way we could get rid of a bunch of unsafe code we are using to interface with libsolv through Rust’s FFI.

Performance tuning

Resolving and installing dependencies is a complex process that can take minutes, especially when done for the first time. This is annoying, particularly in the context of CI pipelines, where fast feedback is invaluable. Performance is one of the reasons why rattler is written in Rust. It should be able to set up a working Python environment in a much shorter timeframe than traditional Python-based tools such as miniconda!

Rust gives you pretty decent performance for free, but there is always room for more if you are willing to put in the effort! For instance, I built a prototype to generate docker images from conda environments, which is very convenient for some scenarios3. Another example is @baszalmstra’s PR to sparsely load the package index, inspired by Cargo’s new sparse protocol, shaving off seconds in the dependency resolution stage. And there are more performance improvements underway!

Speaking about performance improvements, I also got to build rattler-server, which resolves dependencies upon request, taking around 300 milliseconds instead of the 10 to 20 seconds it usually takes (even when the package index is cached locally). The performance boost is achieved with a clever trick, suggested by @wolfv, which consists of preloading the available dependencies in libsolv and caching the state of the solver in-memory4.

More on rattler-server

If you feel like playing with rattler-server yourself, go ahead and clone the repo! All it takes is to run cargo run --release. If you are a Windows user, though, you will need to do this inside WSL, because we are using some libc functions that are otherwise unavailable (just for clarity: rattler is fully cross-platform, but rattler-server is not).

Once the server is running, you can try POSTing the following body to localhost:3000:

{
    "platform": "linux-64",
    "specs": ["numpy"],
    "virtual_packages": ["__unix"],
    "channels": ["conda-forge"]
}

Since this is the first request to the server, it will take between 10 and 15 seconds to download and cache the package index from conda-forge. Future requests should complete within 300 milliseconds. You can see for yourself by POSTing a new request for a different set of dependencies (e.g. installing the famous ncurses C library):

{
    "platform": "linux-64",
    "specs": ["ncurses"],
    "virtual_packages": ["__unix"],
    "channels": ["conda-forge"]
}

The response is too long to include here in its entirety, so below you can see a summarized version of it:

{
  "packages": [
    // Leading packages omitted for brevity...
    {
      "name": "ncurses",
      "version": "6.3",
      "build": "h27087fc_1",
      "build_number": 1,
      "subdir": "linux-64",
      "md5": "4acfc691e64342b9dae57cf2adc63238",
      "sha256": "b801e8cf4b2c9a30bce5616746c6c2a4e36427f045b46d9fc08a4ed40a9f7065",
      "size": 1025992,
      "depends": [
        "libgcc-ng >=10.3.0"
      ],
      "constrains": [],
      "license": "X11 AND BSD-3-Clause",
      "timestamp": 1649338526116,
      "fn": "ncurses-6.3-h27087fc_1.tar.bz2",
      "url": "https://conda.anaconda.org/conda-forge/linux-64/ncurses-6.3-h27087fc_1.tar.bz2",
      "channel": "https://conda.anaconda.org/conda-forge/"
    }
  ]
}

The response comprises a list of packages that satisfy the dependencies you specified in the request. While not visible in the example above, because it is summarized, it is interesting to note that the packages are sorted topologically. With this information, you can write custom tooling to initialize a virtual environment by downloading and installing the packages in order. This is what Outerbounds is doing, for instance, to setup their machine learning infrastructure (props to them, who generously sponsored the development of rattler-server!)

Closing thoughts

Participating in the birth of rattler was an exciting experience! There is obviously still a lot to do, so if you are looking for an open source project to contribute to, this might be your chance. You might, for instance, want to fuzz the rattler_libsolv crate, which uses plenty of unsafe code for FFI (will you earn a place in the trophy case?). There is also a list of issues marked as “good first issue”, if you’d rather contribute with code. Or you could build your own tooling on top of rattler and tell the guys at prefix about it (they have a Discord server).

As to myself, this week I started working on my next engagement, sponsored by Stormshield to enhance Quinn (the community-driven QUIC implementation in Rust). I’ll be resurrecting this PR as a first step, and if everything goes well will stay around for a few months to improve the library some more. I’ll make sure to write about it when there is more to tell!

In the meantime, if you have any comments, suggestions, ideas, etc. you want to share, feel free to contact me (details are in the Consulting page). You can also discuss on HN.

Less sexy than the above, but not less important, was figuring out how to properly test everything. Since my first approach to Rust, back in 2014, the Rust ecosystem has come a long way, and we currently have a bunch of very useful crates to aid with testing. Here are a few that proved especially usfeul:


  1. Actually, there are multiple repositories, called channels, each with its own maintainers; you can distribute arbitrary packages through them (binaries, C++ libraries, Python libraries, etc). You can read more about conda and its relationship with rattler in rattler’s readme. Also, the guys at prefix.dev are true conda wizards, so if you have questions you should definitely spam them on their Discord server↩︎

  2. Props to @baszalmstra for the recording! ↩︎

  3. @tdejager is working on turning the result into a Prefix.dev product, so stay tuned! ↩︎

  4. In case you are curious, this is the relevant PR enhancing rattler to support keeping around an in-memory representation of libsolv’s state. ↩︎