Have anyone managed to do this on a personal account (not org)? To let a NixOS machine (ideally using containers) act as self-hosted runner for many personal repositories.
For a single repository, I added a runner with config like this:
{ pkgs, config, ... }:
{
# TODO: Run inside container
services.github-runners = {
emanote-runner = {
enable = true;
name = "emanote-runner";
# TODO: use sops-nix
tokenFile = "/home/srid/runner.token";
url = "https://github.com/srid/emanote";
extraPackages = [ pkgs.cachix pkgs.nixci ];
extraLabels = [ "nixos" ];
};
};
nix.settings.trusted-users =
builtins.map
(runner: runner.user)
config.services.github-runners;
}
However, it appears you can't share a runner with other personal repos.
Apparently it is possible to share a runner across an organizations repositories. I'm not sure if this applies to user repos though
Looks like you can only share across org, but not personal repos. Sad.
So I decided to write a small module to help with that,
https://github.com/srid/nixos-config/pull/40
You still have to manually create the runner in Settings for each repository. And then copy paste the token into the config for immediate deployment (if it not be immediate the token gets invalidated).
Runners in containers: https://github.com/srid/nixos-config/pull/41
Yeah, so on my nrdxp/nrdos repo I setup 16 runners to avoid the one job at a time bottleneck. Each one is in a nixos-container for a layer of isolation, but has the nix store and daemon socket mounted so that the host daemon can manage and coordinate all the jobs, respecting its configured limits.
This is coupled with divnix/std-action, which I designed to do one eval up front and distribute the derivations to multiple build matrices after, which worked great since they are all on the same disk. One of the biggest bottlenecks in the past was the network cost of transferring the evaled derivations around.
You can see a sample run here:
https://github.com/nrdxp/nrdos/actions/runs/7719678719
Discovery is the only bottleneck, but once that's done all the builds can kick off in parallel each in its own runner.
Oh the discovery->matrix part is interesting. Is std-action really required here? I wonder how I can do the same from scratch.
It was originally designed to consume Standard's API, but it's fairly simple in practice. It should be fairly simple to port to the regular flake API if desired. The core idea is to nix eval
a single json file containing a list of all the Nix derivations you plan to build. I came to this because Nix's eval cache is subpar at best, and because I noticed that multiple runners were evaling the same things over and over.
The real trouble I came upon is that the cost of evaluation in that context (of multiple runners building from a shared flake) is unknowable, especially since Nix has a severe lack of any real profiling tooling (for the language itself at least). Better then, to consolidate all evaluation up front, where the single threaded Nix evaluator can share any state between derivations in a single pass, and where the cost can at least become linear and predictable.
I've observed that running nix build
separately for each output (whether it is specified in same CLI invocation or not) would slow down evaluation by O(n),
https://github.com/srid/devour-flake?tab=readme-ov-file#why
Which is how I came up with https://github.com/srid/nixci and use that in CI. But it doesn't have the nice per-output build status.
Tim DeHerrera said:
Yeah, so on my nrdxp/nrdos repo I setup 16 runners to avoid the one job at a time bottleneck. [..]
Was this fully automated? Or, like me, did you have to manually create the token for each of those runners?
Oh wait, you are using the same token for all runners?
Srid said:
Oh wait, you are using the same token for all runners?
Assuming this token was created from nrdxp/nrdos repository, your 16 runners can run only jobs from that repo, and not from elsewhere, such as nrdxp/foo
For now yes, they are only setup for that repo. I can set up more for others repo if when needed. They have almost no overhead unless/until they start building anyway.
Actually, now that I think about it, I wonder if I even need to mount the /nix/store at all inside these containers. Since the host daemon will be the one to do the actual building it might not be necessary. I already have nix.enable = false
in their configs. I was wondering if I should actually make these VMs instead of containers for added security isolation.
Only the token needs to be bind mounted. I didn't have to mount /nix/store
Do you pass through the socket? In my case, for the sake of avoiding repeated work, I want the builds to all work from the same store, since they share dependencies.
But I think bind mounting the host daemon socket is enough to accomplish that.
Nope, I bind mount only the token.
It seems to automatically use the host nix store,
[root@github-runner-emanote:~]# mount| grep store
/dev/md/nixos on /nix/store type ext4 (ro,relatime,stripe=32)
I still want them all to speak to the same daemon, so that the set limits on jobs and cores are globally respected between them
I have two runners, and I see this:
❯ ps -ef | grep nix-daemon
root 1459 1 0 Jan29 ? 00:00:00 nix-daemon --daemon
root 132628 1459 0 Jan30 ? 00:00:00 nix-daemon 132605
root 132971 1459 0 Jan30 ? 00:00:00 nix-daemon 132948
Are these child daemons spawned for each container? And if so, do they inherit jobs/cores limits automatically?
The daemon does spawn child procs for each build, but if the hosts socket is not available then they could be independent instances.
If the /nix/store is auto mounted though, perhaps the socket is as well. Not sure (not at my desk to check ATM)
Oh yea,
[root@github-runner-emanote:~]# mount| grep socket
/dev/md/nixos on /nix/var/nix/daemon-socket type ext4 (ro,relatime,stripe=32)
It mounts quite a few things,
[root@github-runner-emanote:~]# mount| grep /dev/md/nixos
/dev/md/nixos on / type ext4 (rw,relatime,stripe=32)
/dev/md/nixos on /run/host/os-release type ext4 (ro,nosuid,nodev,noexec,relatime,stripe=32)
/dev/md/nixos on /nix/store type ext4 (ro,relatime,stripe=32)
/dev/md/nixos on /nix/var/nix/daemon-socket type ext4 (ro,relatime,stripe=32)
/dev/md/nixos on /nix/var/nix/db type ext4 (ro,relatime,stripe=32)
/dev/md/nixos on /nix/var/nix/gcroots type ext4 (rw,relatime,stripe=32)
/dev/md/nixos on /nix/var/nix/profiles type ext4 (rw,relatime,stripe=32)
/dev/md/nixos on /etc/localtime type ext4 (ro,nosuid,nodev,relatime,stripe=32)
error (ignored): error: reached end of FramedSource
[..]
error: opening file '/nix/store/i8jjpg7im5jgr6dvr9ikylpa1szx1kpi-treefmt-check.lock': Permission denied
https://github.com/srid/haskell-flake/actions/runs/7885297185/job/21516934365
And this is happening only in Github Runner inside containers. I wonder if this is to do with the mounted nix-store access from the container.
Have you seen this? @Tim DeHerrera
I had to run the following to fix this:
nix-store --verify --repair --check-contents
But the problem reappears when doing the CI build again.
hmm, haven't seen that one yet, have you tried making the daemon socket writable (mine is)?
Nope, but I'll try that if this happens again.
TIL that you can just use a single personal access token (with 'repo' scope) for all runners across one's personal repositories.
Now supports organizations:
services.easy-github-runners = {
"srid/emanote" = { };
"srid/haskell-flake" = { };
"srid/nixos-config" = { };
"srid/ema" = { };
"EmaApps/orgself".owner = "srid";
};
https://github.com/srid/nixos-config/blob/master/nixos/easy-github-runners.nix
Just a heads up, I tried updating my nixos-unstable pin and somebody has modified the github runners in a way that my configuration now totally breaks. I cannot get the nix client to talk to the daemon and I am unsure why. I'll keep you posted if I figure it out
Alright so two things:
Firstly, they appear to have removed the old services.github-runner
option and only have the newer services.github-runners
option now, which led to a reorganization of the code, which means my snippet to import the options and use them in containers broke. So that's the first thing.
Even taking them out of the container and just running them directly on the host I was still failing to communicate with the daemon. This appears to be because the dynamically created users created by systemd were not trusted users. I thought I had already addressed that, but apparently not or I reverted it or something.
In any case, I was able to add the dynamic users to the trusted users list like so:
nix.settings.trusted-users = lib.genList (i: "github-runner-nrdos-${i}") 16
To match the names of the dynamically allocated users. I may swing back around and configure each runner in a minimal microvm for added isolation with just the host daemon socket mounted. But for now this is fine since I only have them configured for a private repository at this point.
seems kind of presumptuous to assume all runners should have Nix. Hopefully we can get this reviewed and merged quickly, cause I need it :sweat_smile:
https://github.com/NixOS/nixpkgs/pull/289607
I believe my module has always been using the new github-runners
service.
By the way, nix-darwin PR just got merged: https://github.com/LnL7/nix-darwin/pull/859
I'll play with macOS support.
The service is not new, what's new is that the code has been restructured and my previous code broke.
The runner failed with a new error,
The local machine's clock may be out of sync with the server time by more than five minutes. Please sync your clock with your domain or internet time and try again.
The local time on the container and host is incorrect, though. Wonder how that happened ... I'm running this in Parallels.
Yup, it was Parallels. Rebooting the VM fixed it.
Let's see if disabling automatic time sync fixes it long-term,
I made the mistake of not setting a good disk size. 60GB (the default) quickly ran out in CI. Neither boot.growPartition
nor systemd repart worked to repartition root on boot, after I resized the disk image in Parallels.
I upgraded nix-darwin
which has a revamped github-runners
module that explicitly writes the log files.
Which allowed me to actually observe crash logs like this:
-bash-3.2$ pwd
/var/lib/github-runners
-bash-3.2$ tail -5 nixci-1/_diag/Runner_20240325-120051-utc.log
[2024-03-25 12:00:53Z ERR Runner] GitHub.DistributedTask.WebApi.TaskAgentExistsException: A runner exists with the same name nixci-1.
at GitHub.Runner.Listener.Configuration.ConfigurationManager.ConfigureAsync(CommandSettings command) in /private/tmp/nix-build-github-runner-2.314.1.drv-0/src/src/Runner.Listener/Configuration/ConfigurationManager.cs:line 306
at GitHub.Runner.Listener.Runner.ExecuteCommand(CommandSettings command) in /private/tmp/nix-build-github-runner-2.314.1.drv-0/src/src/Runner.Listener/Runner.cs:line 126
[2024-03-25 12:00:53Z ERR Terminal] WRITE ERROR: A runner exists with the same name nixci-1.
[2024-03-25 12:00:53Z INFO Listener] Runner execution has finished with return code 1
-bash-3.2$
https://github.com/LnL7/nix-darwin/pull/893
As for this particular problem, there is replace = true;
which solves the issue.
3 messages were moved here from #nix > Darwin: launchctl logs for a service by Srid.
I decided to do this the other way. Setup runners on the NixOS VM, and do remote build on macOS as necessary.
Rough notes: https://github.com/srid/nixos-config/tree/master/clusters/github-runner
Alright, based on how I'm setting CI for Juspay Github, the new approach is not to use distributed builds, but just dedicate runners to each host.
Module & tutorial incoming, but here's a sneak preview:
https://x.com/sridca/status/1804286220304028106
Awesome, I will look at this, once I am out of drama mindset thanks to the ongoing legacy NixOS community.
Here it is:
https://github.com/juspay/github-nix-ci
I'll announce it in a bit, after some final polish, and feedback (if any).
https://x.com/nixos_asia/status/1805609192843014276
https://discourse.nixos.org/t/github-nix-ci-for-self-hosting-github-runners-on-macos-linux/47642
Last updated: Nov 15 2024 at 13:04 UTC