Derivation storing service · cull-os

I've been thinking about this for a bit, but what if we had a thing to store all versions of a certain derivation and its outputs, while making it possible for people to query them with names and versions? This is not so different from the current nix package management commands with added version storage.

But what I had in mind was letting people patch these derivations. By default, if you want to install a package X, which depends on package [email protected], but you have Y version 3.3 on your system, the tool could trivially pull X, rename it to a different drv path (calculated with the new Y version) and patch it. This would be done for any backwards compatible version upgrade, or hash mismatch for when the versions match

RGBCube (Oct 07 2024 at 08:23):

RGBCube (Oct 07 2024 at 13:27):

This would be a completely different thing from Cab (built on top of it; very optional) and I'm thinking of calling it the "Cab Package Registry" (cli tool would be aptly named cpr)

Andreas (Oct 10 2024 at 19:08):

I think Tim might have some comments on that one as well. I'll let him know so that he gets back to you.

Tim DeHerrera (Oct 10 2024 at 20:15):

I have already spent some months working on just such a format, and I think you may appreciate some of its novelties. So far I already have the publishing code and a convenient URI syntax for addressing them. I can open up the codebase to you if you'd like to take a look. Also happen to maybe introduce the concept in an meeting if you prefer

Tim DeHerrera (Oct 10 2024 at 20:17):

However, I'm not sure if what I am working on is exactly what you are looking for here. What I am working on is a sort of decentralized publishing mechanism for Nix (or Nix like, could easily work with Cab) code that would alleviate the growing burden of fetching an ever growing nixpkgs n times, along with some other nice properties, such as efficient and decentralized querying, and I also have plans for a full version resolution scheme for efficient dependency management

Tim DeHerrera (Oct 10 2024 at 20:20):

We also have a private mattermost instance for discussing the project and some other random stuff as well

Tim DeHerrera (Oct 10 2024 at 20:26):

One thing I am particularly proud of is a novel identifcation format which allows us to track a large number of these "atoms" as I call them across repositories without collisions, which will be extremely useful for efficiently caching them and returning results without any evaluation necessary (assuming its been cached).

RGBCube (Oct 11 2024 at 13:29):

This is interesting. Not sure what to make of the PR, but the concept I had in mind was fundamentally incompatible with using the cached derivations in Cab expressions, because every derivation must be produced by an expression itself, which the storing service does not provide.

RGBCube (Oct 11 2024 at 13:31):

Aside: I love the use of Result as Either and Ok, Err as named variants, will definitely use that in the future.

RGBCube (Oct 11 2024 at 13:33):

How do you "cache" the Nix expressions? That's the part I don't fully understand, as my understanding is that Nix always needs Nix code to evaluate derivations. Sure, primitive versions of caching is possible but how do you mitigate fetching nixpkgs _n_ times?

Tim DeHerrera (Oct 11 2024 at 15:46):

Well my approach is a fundamental shift in how we would run a Nix evaluation, based on 10 years of being annoyed at how costly evaluation can be sometimes, and the experience I gained in that time from Nix and software engineering in general.

So far includes two components, a novel Nix module system that actually provides sensible boundaries for Nix code for not much cost (it is actually insanely fast compared to the NixOS module system). This piece makes the Nix code bounded and trvially statically analyzeable. It perpusefully bucks the trend of depending on nixpkgs and being super complex and heavy. The core of it took me two hours to write and I have kept it purposefully small since. Besides performance and predictability, another nice feature is trivial tracing (unlike the monsterous traces produced by the nixos module system). There is still some work to do to integrate it more cleanly with my CLI, and I am almost at that stage.

As for the Rust based cli, there are a few core planned features but right now all I really have (although its pretty solid at this point) is a publish subcommand. This CLI is not meant to be just another Nix wrapper, but instead, a higher level frontend to a Nix like build service (I envision a Guix integration at some point in the future as well). Publishing is somewhat abstracted to allow different storage backends but the inaugural implementation, and probably source of truth for all other backends, is an implementation in pure git.

I use gitoxide to do some non-standard git things in a very efficient and secure manner. Essentially I detect my format (based on a unique file-extension), and create a detached (orphaned) history containing just that "atom" as I call it. Similar to a crate, it contains a manifest and some source files (in a self-contained directory). Since some usecases in Nix may be aided by shared static config, the source dir is optional.

These are then versioned (a semver is required in the manifest) and stored in unique gitrefs under a custom prefix. The contents are simply new git trees (references to already existing blobs). There is nowhere in the code that writes new blobs anywhere, so this essentially ensures that there is no way to "corrupt" the files during publishing, and also makes the format extremely light on gits store, since tree objects are akin to mere references, they are cheap to create, store and crucially, fetch.

I store some additional, non-standard, meta-data in the commit header, such as the commit from which the atom originates, and commit it using a constant timestamp and author information, making the atom commit fully reproducible. Not implemented yet, but I plan to also allow for optionally signing the atom commit with a tag object. This way one can short circuit manual verification if they trust the key, but even so, manually verifying the contents of an atom is trivial.

A ref pointing to the original source commit is also made in the atom's ref prefix, so that as long as an atom exists, so to will the source it came from. Verifying is as trivial as pulling it and checking the objectid of the source tree and manifest blob, if you are ever in doubt. You might be wondering "why bother"? I was trying to think of a system that could be incrementally adopted in a repo as large as nixpkgs, which would also make it more efficient to pull different packages from various points in history (different versions) without having to grab the entire nixpkgs tree n times, which is a persistent, and growing burden, that flakes actually made far worse.

I am close to ready to start working on a full blown version resolver that resolves version constraints for these atom's across repositories, and produces a truly minimal set of dependencies (much unlike our friend the flake), directly from source while also remaining extremely cheap to fetch and totally self-contained in git. The versions themselves are contained in the refspec, so resolving the atoms a repository has does not require fetching any of its contents, which is why I spent a lot of time ensuring I do not break this property when writing the publishing code.

I envision a 3rd critical piece which we have only been brainstorming, and only have a skeleton repo for so far, but it would essentially serve as a much more efficient backend for Nix like evaluators, and serve to decouple evaluation from the caller (our cli binary). This is where my atom id concept (implemented during publishing) comes in to bolster caching. An atom contains a unicode id in the manifest, however, in order to make its identity fully unambiguos and avoid an annoying global namespace (like rust crates suffer from), I also include the concept of a "root" which is used as a key input for a blake3 sum over the atom unicode id.

This "root" key, in the git implementation I just described, is the oldest, parentless commit in the repositories history (the origin of the repository). I decided on this as I was trying to think of a way to unambiguously identify a git repo, without replying on emphermal information like remote names, etc (since those can change but you still have the same code and history underneath).

If the very root of your history changes, you very clearly no longer have quite the same codebase, so to me it seemed like the perfect identifier. This also makes the atom format decentralized, just as git, since you can just publish them to as many remotes as you like. In any case, this ID will be used extensively to track information in this backend we are still designing. We plan to use capt-proto as the exchange format to make it extremely efficient to offload builds and evaluations to multiple instances.

Everything this backend builds or evals will be tracked, by atomid and then also by derivation information which already exists, creating a trivial mapping to the final artifact (if it exists). This is what will (once fully implemented) allow a user to simply call for a package and instantly start to download it, avoiding evaluation entirely (if it was already built).

I arrived at all this not to try to replace or even compete with Nix, but through my observation after all these years that Nix is actually quite good as a low-level build tool. It is a not a high level piece, and we shouldn't try to make it so (flakes), as it is fundamentally working at a different level of abstraction. Users simply don't care about derivations, as useful as they are. Everybody does already know what a version is though, and it is an abstraction we have sort of lost in Nix, pinning everything exactly. This model tries to essentially reintroduce it in a principled manner, while retaining the benefits of Nix and mitgating many of its pains.

So again, this is a backend agnostic system envisioned to provide a proper, user (and developer) level abstraction which makes working with the low-level concept of derivations a breeze, by abstracting the nuisance, creating clean boundaries, reducing reliance on Nix code outright (I envision a plugin system for the cli to generate inputs), and only use it strictly as the DSL it was designed to be, and not the monsterous and growing beast it has become a la the module system, etc.

None of the pieces I have so far are tied to Nix in any way, and that would be an indication of a boundary violation. It should remain abstract enough to be useful for any Nix like tool (and perhaps future usecases I haven't envisioned). In any case, we are going to be going public soon.

If I could give a high-level principle that has motivated this whole thing it would be something like, "nothing is cheaper than static". If you keep as much as possible about the build statically knowable (the manifest, the refs, the atomids in the backend, the evaluation and build history) you don't have to worry about some eval taking an indeterminate amount of time in between you and the thing you want to use, at least not more than once (for the inital build).

Sorry for the novel, but its hard to explain the plan without all the pieces :joy:

Tim DeHerrera (Oct 11 2024 at 15:51):

We plan to also release a whitepaper when we go public, so that I cam formalize all the novel pieces and make it clear our direction

Srid (Oct 11 2024 at 16:02):

Tim DeHerrera (Oct 11 2024 at 16:03):

Srid (Oct 11 2024 at 16:19):

Milestones would also be good to see. In particular, I want to see at what point I can replace flakes+nixpkgs use in real-world projects (while retaining same functionality).

Tim DeHerrera (Oct 11 2024 at 16:23):

Yep, that's what comes next now that the format is stable and the code for it is properly tested. Next piece is integrating my novel Nix module system with the atom format produced by the cli, which should be fairly simple, I am really only missing one piece there that I already have a plan for.

At that point it should be usable, although dependencies would have to be wired up manually, which is a pain. When it would be a truly nobrainer replacement for flakes is what comes after, I will start working on the resolver right after that, and once that's done, I can't imagine any reason why you would want a "flake" again at that point, even without the backend implemented. The resolver is a bit of a technical challenge, but there are some good libraries in the Rust ecosystem to lesson the burden somewhat.

Tim DeHerrera (Oct 11 2024 at 16:26):

Its also worth noting that I take abstraction and boundaries quite seriously after paying the price for over a decade of not having them respected. So my Nix module system is independantly useful inside of a flake if you really wanted to use it, just to help you organize your code.

RGBCube (Oct 11 2024 at 18:30):

RGBCube (Oct 11 2024 at 18:34):

I haven't fully understood the concept but it sounds really similar to how Radicle stores data, at least in the root level. Sharing that could simplify things a ton (and might even remove the need to explicitly store those files when interacting with a Radicle repository). Could you show me the codebase / practical examples on how the tool is used?

RGBCube (Oct 11 2024 at 18:34):

Tim DeHerrera (Oct 11 2024 at 20:08):

Radicle is more of a blockchain thing right? I am aware of it, tested it years ago. I just don't want any exeternal dependencies outside of git. We also talked about how Radicle is a bit redundant just recently, given that git is already decentralized by nature (even if it isn't used that way). I don't know enough about its architecture beyond that to know if there is cross-over though.

I have seen pubgrub as well. I am leaning toward resolvo though, at this point. Haven't started though, so it could change. No worries though, no rush

Tim DeHerrera (Oct 11 2024 at 20:08):

but yeah, I can open the codebase to you. As far as practical examples, that is where I am at now, trying to actually take the format that I have somewhat stabilized now, and put it to actual use

Tim DeHerrera (Oct 11 2024 at 20:09):

but you can use it to publish anything in the format right now, just need to put in some of the nix pieces to make a more "full demo". Also what is your GH handle, I'll add you to the repos

Tim DeHerrera (Oct 11 2024 at 20:15):

Perhaps the thing to emphasize is the "orphaned" and truncated history. And atom only contains the files pertinent to itself, with nothing else from the repo. Also has no history so there is no "fetching the entire damn history" for tools that don't know or care was a shallow clone is.

So with that, you could reference thousands of different package versions across nixpkgs (even if it stays a mono-repo) and never have to fetch the entire tree, only the data pertinent to your actual build (or whatever else you have, deployment, config, etc, etc)

Tim DeHerrera (Oct 11 2024 at 20:39):

Tim DeHerrera (Oct 11 2024 at 20:42):

Tim DeHerrera (Oct 11 2024 at 21:14):

actually one thing I do have implemented already is a more general replacement for flake uris. One that is extensible (one can configure any url shortener they like). So you could do something like:
my-repo::my-atom@^1 and the cli will look in your config file for an "alias" called my-repo and resolve it to whatever you have set, say example.com/work-repo.git. You can also add a fragment after like flake uri:
gh:owner/repo::atom, etc, etc

Tim DeHerrera (Oct 11 2024 at 21:29):

eka is the Rust cli, atom the Nix module system I've been talking about. I made some effort to document the latter, I have a wip branch with some higher level docs as well, but still early. eka is not very well documented at all, however I did just go through and ensure the atom crate is extensivvely documented in code at least, just not much higher level doc outside the clap cli help messages (which are hopefully helpful though)

Gonna start integrating the two right now actually, so there should be more cohesion between the two soonish

RGBCube (Oct 12 2024 at 04:51):

RGBCube (Oct 12 2024 at 04:53):

By limiting the allowed operations of the module system (no import) you can trivially track changes & re-eval only that subset of the tree to re-record the effects it produces

RGBCube (Oct 12 2024 at 04:54):

And this thing is trivially applicable to package evaluation as well, as the client can just assume that the atoms stored in each "branch"(?) are up to date. Updating them is the job of the author, after all

RGBCube (Oct 12 2024 at 04:56):

And since if you don't clean your atom history up, you get ever "derivation" that has ever existed stored in your git repository as atoms. Aha

RGBCube (Oct 12 2024 at 04:58):

But how do you traverse those headless atoms to find the version 3 of let's say root.foo? Assuming that the current version is newer and that older version is never referenced anywhere in the graph that touches master

Tim DeHerrera (Oct 12 2024 at 14:23):

❯ g ls-remote
From ssh://[email protected]/ekala-project/eka
50378b2dace5628160724e9e6e855bc0d062865a        HEAD
ceebaca6d44c4cda555db3fbf687c0604c4818eb        refs/atoms/ひらがな/0.1.0
a87bff5ae43894a158dadf40938c775cb5b62d4b        refs/atoms/ひらがな/_specs/0.1.0
9f17c8c816bd1de6f8aa9c037d1b529212ab2a02        refs/atoms/ひらがな/_srcs/0.1.0

This is how atom refs are stored. Essentially atoms are a way of bringing back versions in a meaningful way. They really should just be immutable (similar to version tags) unless something terrible happens (like a critical CVE) in which case you should maybe just delete the troubled version and release a patch update.

I use the semver crate, so you can also release dev or pre releases as well (anything semver accepts), but the point is, these are resolvable without a fetch (listing references is the only git operation that does not require fetching or cloning, which is the reason I store them this way). So you can list all the available versions of any repository pretty cheaply. The git protocol also includes server side ref filtering, so you can limit your search to just the refs/atoms heading, or even a subset of atoms.

I'm thinking maybe the backend build/eval service should have some sort of federated search protocol as well, perhaps (undecided), so that you can easily search multiple repos with one subcommand (say eka search)

The _specs ref above is a lightweight ref that only contains the manifest and lock. This is for the resolver.

In order to fully traverse the dependency tree, the resolver will have to read the manifests of candidates, so I want to keep those extremely cheap to fetch (in case the source directory gets large)

Tim DeHerrera (Oct 12 2024 at 14:33):

so even though atoms can live under any arbitrary directory in the repo, however many levels deep, they are identified not by their filepath, but by their:

[atom]
id = "my-atom"

This way, the tree of atoms is "flat" for any given repo, simplifying the search as well. For sanity, the source tree is searched before publishing to ensure no two atoms in the same commit have the same id, if they do, the request will fail unless/until you rename one. This is similar to cargo crates in a workspace. If any two have the same name, cargo will complain and not let you do anything.

RGBCube (Oct 12 2024 at 18:15):

That makes a ton of sense. But for the refs you might want to do refs/atoms/.../0.1.0/{atom,spec,src} for a better hierarchy

RGBCube (Oct 12 2024 at 18:17):

If I'm understanding correctly every "module" in this system will have to either have a user assigned atom id or an automatically generated one? How does your module system do it?

Tim DeHerrera (Oct 12 2024 at 18:26):

I did have that structure at one point, I've just been experimenting a bit, I might change it back actually, just need something that is fairly trivial to filter for just the atom ref itself during a search. git has its own semantic for filtering refs that is pretty similar to globbing in a shell, though I guess something like refs/atoms/*/*/atom would do it, so yeah you are probably right.

The only thing set in the manifest is one of these Ids. The other compoent, the root field is determined by where the atom is stored (why its generic). The git implementation uses my upstream patch to gitoxide to quickly calculate the very first commit in the history from which the atom is derived and uses that.

This gives us an unambiguos hash of this specific atom, independant from other atoms in other repositories, even if they have the same unicode name. We can store this hash in lock files, and backends can use it to disambiguate, essentially giving us a huge namespace, avoiding collisions even where two projects might use the same human readable unicode id.

Tim DeHerrera (Oct 12 2024 at 18:28):

more simply, an atom is akin to a rust crate, and a module is akin to a rust module (indicated by the mod.nix)

Tim DeHerrera (Oct 12 2024 at 18:36):

Right now I also post a tag containing this root and ensure that every atom comes from the posted root. It is a sort of "iitialization" akin to a git init, and also provides a sanity check that the remote and local repos agree on history. This isn't strictly necessary, I could technically remove it and am considering if I really need it or not.

It provides is a way for a remote user, who has not clone the repo, to calculate an atom's id, so for that reason alone I might just keep it, but also, it serves as a nice sanity check that we are not publishing atoms from inconsistent histories in a single repo. The initialization process technically targets a specific git remote, not the repository as a whole.

So if you wanted unrelated histories in one repo and wanted to publish atoms from both you'd just have to init the respective remotes with the relevant historical root. I was thinking you might use this for "secret" atoms that are published to a priviledged remote, as one possible usecase

Stream: cull-os

Topic: Derivation storing service

RGBCube (Oct 07 2024 at 08:22):

RGBCube (Oct 07 2024 at 08:23):

RGBCube (Oct 07 2024 at 13:27):

Andreas (Oct 10 2024 at 19:08):

Tim DeHerrera (Oct 10 2024 at 20:15):

Tim DeHerrera (Oct 10 2024 at 20:17):

Tim DeHerrera (Oct 10 2024 at 20:20):

Tim DeHerrera (Oct 10 2024 at 20:26):

RGBCube (Oct 11 2024 at 13:29):

RGBCube (Oct 11 2024 at 13:31):

RGBCube (Oct 11 2024 at 13:33):

Tim DeHerrera (Oct 11 2024 at 15:46):

Tim DeHerrera (Oct 11 2024 at 15:51):

Srid (Oct 11 2024 at 16:02):

Tim DeHerrera (Oct 11 2024 at 16:03):

Srid (Oct 11 2024 at 16:19):

Tim DeHerrera (Oct 11 2024 at 16:23):

Tim DeHerrera (Oct 11 2024 at 16:26):

RGBCube (Oct 11 2024 at 18:30):

RGBCube (Oct 11 2024 at 18:34):

RGBCube (Oct 11 2024 at 18:34):

Tim DeHerrera (Oct 11 2024 at 20:08):

Tim DeHerrera (Oct 11 2024 at 20:08):

Tim DeHerrera (Oct 11 2024 at 20:09):

Tim DeHerrera (Oct 11 2024 at 20:15):

Tim DeHerrera (Oct 11 2024 at 20:39):

Tim DeHerrera (Oct 11 2024 at 20:42):

Tim DeHerrera (Oct 11 2024 at 21:14):

Tim DeHerrera (Oct 11 2024 at 21:29):

RGBCube (Oct 12 2024 at 04:51):

RGBCube (Oct 12 2024 at 04:53):

RGBCube (Oct 12 2024 at 04:54):

RGBCube (Oct 12 2024 at 04:56):

RGBCube (Oct 12 2024 at 04:58):

Tim DeHerrera (Oct 12 2024 at 14:23):

Tim DeHerrera (Oct 12 2024 at 14:33):

RGBCube (Oct 12 2024 at 18:15):

RGBCube (Oct 12 2024 at 18:17):

Tim DeHerrera (Oct 12 2024 at 18:26):

Tim DeHerrera (Oct 12 2024 at 18:28):

Tim DeHerrera (Oct 12 2024 at 18:36):