The birth of a package manager
Since my time at the university, pursuing a Computer Science degree, I have always been fascinated by programming languages and the tooling around them: compilers, IDEs, package managers, etc. Eventually, that got me involved as a hobbyist in the development of the Rust compiler and rust-analyzer, but I never got the chance to work professionally on programming language tooling… until two months ago! In January, the nice folks at prefix.dev asked me to help them develop the rattler package manager, and there is lots to tell about what we have achieved since then, so buckle up!
Ehm… what is rattler?
Good question! Here is what the official announcement has to say:
[Rattler is] an open-source Rust library for working with the conda ecosystem.
If you are like me and have never heard of “the conda ecosystem” before, this description might
leave you with more questions than you already had. The conda rabbit hole is deep, but we can get
quite far with an oversimplification: the conda community maintains a repository of software
packages1. Rattler is able to, given a set of dependencies, determine which exact versions need
to be installed, and then proceed to install them in a virtual environment. You can also use it as a
CLI, as shown below2, where we see rattler installing the cowpy
package to a virtual
environment, including a suitable version of python that is actually able to run cowpy
(note:
cargo run --release
is running the rattler CLI):
Technical challenges
Working on rattler, and probably on any package manager, brings quite a few interesting challenges on one’s path. Let’s have a look at the most relevant two!
Dependency resolution
As a user of a package manager, you want to specify the names of the packages you are interested in,
and maybe some additional constraints, like which versions are allowed. For instance, if you want to
install numpy
but are required to use Python 3.7, your package manager should take that into
account and tell you to use numpy
version 1.21.5
(instead of version 1.24.2
, which is the
newest one at the time of this writing, but no longer supports Python 3.7).
In the conda ecosystem, every package (e.g. numpy
) has multiple versions (e.g. 1.21.5
, 1.24.2
,
etc). But it doesn’t end there! Even for a particular version, there might be more than one build.
Python libraries such as PyTorch, for instance, provide different builds depending on which GPU you
have (e.g. does it support CUDA?). This and other factors make it complex to resolve dependencies.
Fortunately, the problem of dependency resolution has been studied for a while, and there are production-grade open source solvers suited for the task. We are currently using a fork of libsolv, which relies on the technique of SAT solving. It is not perfect (software never is), but gets the job done.
One interesting avenue of future work is to try to replace libsolv by a solver written in Rust, such as PubGrub. That way we could get rid of a bunch of unsafe code we are using to interface with libsolv through Rust’s FFI.
Performance tuning
Resolving and installing dependencies is a complex process that can take minutes, especially when
done for the first time. This is annoying, particularly in the context of CI pipelines, where fast
feedback is invaluable. Performance is one of the reasons why rattler is written in Rust. It should
be able to set up a working Python environment in a much shorter timeframe than traditional
Python-based tools such as miniconda
!
Rust gives you pretty decent performance for free, but there is always room for more if you are willing to put in the effort! For instance, I built a prototype to generate docker images from conda environments, which is very convenient for some scenarios3. Another example is @baszalmstra’s PR to sparsely load the package index, inspired by Cargo’s new sparse protocol, shaving off seconds in the dependency resolution stage. And there are more performance improvements underway!
Speaking about performance improvements, I also got to build rattler-server, which resolves dependencies upon request, taking around 300 milliseconds instead of the 10 to 20 seconds it usually takes (even when the package index is cached locally). The performance boost is achieved with a clever trick, suggested by @wolfv, which consists of preloading the available dependencies in libsolv and caching the state of the solver in-memory4.
More on rattler-server
If you feel like playing with rattler-server yourself, go ahead and clone the
repo! All it takes is to run cargo run --release
.
If you are a Windows user, though, you will need to do this inside WSL, because we are using some
libc functions that are otherwise unavailable (just for clarity: rattler is fully cross-platform,
but rattler-server is not).
Once the server is running, you can try POSTing the following body to localhost:3000
:
{
"platform": "linux-64",
"specs": ["numpy"],
"virtual_packages": ["__unix"],
"channels": ["conda-forge"]
}
Since this is the first request to the server, it will take between 10 and 15 seconds to download
and cache the package index from conda-forge
. Future requests should complete within 300
milliseconds. You can see for yourself by POSTing a new request for a different set of
dependencies (e.g. installing the famous ncurses
C library):
{
"platform": "linux-64",
"specs": ["ncurses"],
"virtual_packages": ["__unix"],
"channels": ["conda-forge"]
}
The response is too long to include here in its entirety, so below you can see a summarized version of it:
{
"packages": [
// Leading packages omitted for brevity...
{
"name": "ncurses",
"version": "6.3",
"build": "h27087fc_1",
"build_number": 1,
"subdir": "linux-64",
"md5": "4acfc691e64342b9dae57cf2adc63238",
"sha256": "b801e8cf4b2c9a30bce5616746c6c2a4e36427f045b46d9fc08a4ed40a9f7065",
"size": 1025992,
"depends": [
"libgcc-ng >=10.3.0"
],
"constrains": [],
"license": "X11 AND BSD-3-Clause",
"timestamp": 1649338526116,
"fn": "ncurses-6.3-h27087fc_1.tar.bz2",
"url": "https://conda.anaconda.org/conda-forge/linux-64/ncurses-6.3-h27087fc_1.tar.bz2",
"channel": "https://conda.anaconda.org/conda-forge/"
}
]
}
The response comprises a list of packages that satisfy the dependencies you specified in the request. While not visible in the example above, because it is summarized, it is interesting to note that the packages are sorted topologically. With this information, you can write custom tooling to initialize a virtual environment by downloading and installing the packages in order. This is what Outerbounds is doing, for instance, to setup their machine learning infrastructure (props to them, who generously sponsored the development of rattler-server!)
Closing thoughts
Participating in the birth of rattler was an exciting experience! There is obviously still a lot to
do, so if you are looking for an open source project to contribute to, this might be your chance.
You might, for instance, want to fuzz the rattler_libsolv
crate, which uses plenty of unsafe code
for FFI (will you earn a place in the trophy case?).
There is also a list of
issues
marked as “good first issue”, if you’d rather contribute with code. Or you could build your own
tooling on top of rattler and tell the guys at prefix about it (they have a
Discord server).
As to myself, this week I started working on my next engagement, sponsored by Stormshield to enhance Quinn (the community-driven QUIC implementation in Rust). I’ll be resurrecting this PR as a first step, and if everything goes well will stay around for a few months to improve the library some more. I’ll make sure to write about it when there is more to tell!
In the meantime, if you have any comments, suggestions, ideas, etc. you want to share, feel free to contact me (details are in the Consulting page). You can also discuss on HN.
Bonus track: testing-related crates
Less sexy than the above, but not less important, was figuring out how to properly test everything. Since my first approach to Rust, back in 2014, the Rust ecosystem has come a long way, and we currently have a bunch of very useful crates to aid with testing. Here are a few that proved especially usfeul:
- insta: makes it a breeze to add snapshot tests (e.g. tests that assert values against a reference value).
- rstest: provides handy macros to write tests more easily. I
found the combination of
#[rstest]
and#[case]
especially useful to create parameterized tests. - testcontainers: facilitates integration testing by spinning up Docker containers and removing them afterwards.
- mockito: generates HTTP mocks, which you can use to test code that makes requests to HTTP endpoints.
- mock_instant: allows you to test code that uses
Instant
, without having to resort to sleeping or other dirty tricks.
-
Actually, there are multiple repositories, called channels, each with its own maintainers; you can distribute arbitrary packages through them (binaries, C++ libraries, Python libraries, etc). You can read more about conda and its relationship with rattler in rattler’s readme. Also, the guys at prefix.dev are true conda wizards, so if you have questions you should definitely spam them on their Discord server. ↩︎
-
Props to @baszalmstra for the recording! ↩︎
-
@tdejager is working on turning the result into a Prefix.dev product, so stay tuned! ↩︎
-
In case you are curious, this is the relevant PR enhancing rattler to support keeping around an in-memory representation of libsolv’s state. ↩︎