Stream: nixos

Topic: Self-hosted GitHub runners


view this post on Zulip Srid (Jan 26 2024 at 16:56):

Have anyone managed to do this on a personal account (not org)? To let a NixOS machine (ideally using containers) act as self-hosted runner for many personal repositories.

For a single repository, I added a runner with config like this:

{ pkgs, config, ... }:
{
  # TODO: Run inside container
  services.github-runners = {
    emanote-runner = {
      enable = true;
      name = "emanote-runner";
      # TODO: use sops-nix
      tokenFile = "/home/srid/runner.token";
      url = "https://github.com/srid/emanote";
      extraPackages = [ pkgs.cachix pkgs.nixci ];
      extraLabels = [ "nixos" ];
    };
  };
  nix.settings.trusted-users =
    builtins.map
      (runner: runner.user)
      config.services.github-runners;
}

However, it appears you can't share a runner with other personal repos.

view this post on Zulip Tim DeHerrera (Jan 26 2024 at 19:41):

Apparently it is possible to share a runner across an organizations repositories. I'm not sure if this applies to user repos though

view this post on Zulip Tim DeHerrera (Jan 26 2024 at 19:42):

https://github.blog/2020-04-22-github-actions-community-momentum-enterprise-capabilities-and-developer-improvements/#share-self-hosted-runners-across-an-organization

view this post on Zulip Srid (Jan 27 2024 at 06:31):

Looks like you can only share across org, but not personal repos. Sad.

view this post on Zulip Srid (Jan 27 2024 at 06:31):

So I decided to write a small module to help with that,

https://github.com/srid/nixos-config/pull/40

view this post on Zulip Srid (Jan 27 2024 at 06:32):

You still have to manually create the runner in Settings for each repository. And then copy paste the token into the config for immediate deployment (if it not be immediate the token gets invalidated).

view this post on Zulip Srid (Jan 27 2024 at 12:17):

Runners in containers: https://github.com/srid/nixos-config/pull/41

view this post on Zulip Tim DeHerrera (Jan 31 2024 at 15:43):

Yeah, so on my nrdxp/nrdos repo I setup 16 runners to avoid the one job at a time bottleneck. Each one is in a nixos-container for a layer of isolation, but has the nix store and daemon socket mounted so that the host daemon can manage and coordinate all the jobs, respecting its configured limits.

This is coupled with divnix/std-action, which I designed to do one eval up front and distribute the derivations to multiple build matrices after, which worked great since they are all on the same disk. One of the biggest bottlenecks in the past was the network cost of transferring the evaled derivations around.

view this post on Zulip Tim DeHerrera (Jan 31 2024 at 15:49):

You can see a sample run here:
https://github.com/nrdxp/nrdos/actions/runs/7719678719

Discovery is the only bottleneck, but once that's done all the builds can kick off in parallel each in its own runner.

view this post on Zulip Srid (Feb 01 2024 at 03:24):

Oh the discovery->matrix part is interesting. Is std-action really required here? I wonder how I can do the same from scratch.

view this post on Zulip Tim DeHerrera (Feb 01 2024 at 03:48):

It was originally designed to consume Standard's API, but it's fairly simple in practice. It should be fairly simple to port to the regular flake API if desired. The core idea is to nix eval a single json file containing a list of all the Nix derivations you plan to build. I came to this because Nix's eval cache is subpar at best, and because I noticed that multiple runners were evaling the same things over and over.

The real trouble I came upon is that the cost of evaluation in that context (of multiple runners building from a shared flake) is unknowable, especially since Nix has a severe lack of any real profiling tooling (for the language itself at least). Better then, to consolidate all evaluation up front, where the single threaded Nix evaluator can share any state between derivations in a single pass, and where the cost can at least become linear and predictable.

view this post on Zulip Srid (Feb 01 2024 at 07:24):

I've observed that running nix build separately for each output (whether it is specified in same CLI invocation or not) would slow down evaluation by O(n),

https://github.com/srid/devour-flake?tab=readme-ov-file#why

Which is how I came up with https://github.com/srid/nixci and use that in CI. But it doesn't have the nice per-output build status.

view this post on Zulip Srid (Feb 01 2024 at 07:27):

Tim DeHerrera said:

Yeah, so on my nrdxp/nrdos repo I setup 16 runners to avoid the one job at a time bottleneck. [..]

Was this fully automated? Or, like me, did you have to manually create the token for each of those runners?

view this post on Zulip Srid (Feb 01 2024 at 07:28):

Oh wait, you are using the same token for all runners?

view this post on Zulip Srid (Feb 01 2024 at 07:50):

Srid said:

Oh wait, you are using the same token for all runners?

Assuming this token was created from nrdxp/nrdos repository, your 16 runners can run only jobs from that repo, and not from elsewhere, such as nrdxp/foo

view this post on Zulip Tim DeHerrera (Feb 01 2024 at 14:09):

For now yes, they are only setup for that repo. I can set up more for others repo if when needed. They have almost no overhead unless/until they start building anyway.

view this post on Zulip Tim DeHerrera (Feb 01 2024 at 16:44):

Actually, now that I think about it, I wonder if I even need to mount the /nix/store at all inside these containers. Since the host daemon will be the one to do the actual building it might not be necessary. I already have nix.enable = false in their configs. I was wondering if I should actually make these VMs instead of containers for added security isolation.

view this post on Zulip Srid (Feb 02 2024 at 03:18):

Only the token needs to be bind mounted. I didn't have to mount /nix/store

view this post on Zulip Tim DeHerrera (Feb 02 2024 at 05:31):

Do you pass through the socket? In my case, for the sake of avoiding repeated work, I want the builds to all work from the same store, since they share dependencies.

But I think bind mounting the host daemon socket is enough to accomplish that.

view this post on Zulip Srid (Feb 02 2024 at 13:35):

Nope, I bind mount only the token.

https://github.com/srid/nixos-config/blob/9c392f468f2b341af8a83e8061ff1a92face28f5/nixos/github-runner.nix#L110-L131

view this post on Zulip Srid (Feb 02 2024 at 13:54):

It seems to automatically use the host nix store,

[root@github-runner-emanote:~]# mount| grep store
/dev/md/nixos on /nix/store type ext4 (ro,relatime,stripe=32)

view this post on Zulip Tim DeHerrera (Feb 02 2024 at 14:07):

I still want them all to speak to the same daemon, so that the set limits on jobs and cores are globally respected between them

view this post on Zulip Srid (Feb 02 2024 at 14:24):

I have two runners, and I see this:

❯ ps -ef | grep nix-daemon
root        1459       1  0 Jan29 ?        00:00:00 nix-daemon --daemon
root      132628    1459  0 Jan30 ?        00:00:00 nix-daemon 132605
root      132971    1459  0 Jan30 ?        00:00:00 nix-daemon 132948

Are these child daemons spawned for each container? And if so, do they inherit jobs/cores limits automatically?

view this post on Zulip Tim DeHerrera (Feb 02 2024 at 14:28):

The daemon does spawn child procs for each build, but if the hosts socket is not available then they could be independent instances.

If the /nix/store is auto mounted though, perhaps the socket is as well. Not sure (not at my desk to check ATM)

view this post on Zulip Srid (Feb 02 2024 at 14:30):

Oh yea,

[root@github-runner-emanote:~]# mount| grep socket
/dev/md/nixos on /nix/var/nix/daemon-socket type ext4 (ro,relatime,stripe=32)

view this post on Zulip Srid (Feb 02 2024 at 14:30):

It mounts quite a few things,

[root@github-runner-emanote:~]# mount| grep /dev/md/nixos
/dev/md/nixos on / type ext4 (rw,relatime,stripe=32)
/dev/md/nixos on /run/host/os-release type ext4 (ro,nosuid,nodev,noexec,relatime,stripe=32)
/dev/md/nixos on /nix/store type ext4 (ro,relatime,stripe=32)
/dev/md/nixos on /nix/var/nix/daemon-socket type ext4 (ro,relatime,stripe=32)
/dev/md/nixos on /nix/var/nix/db type ext4 (ro,relatime,stripe=32)
/dev/md/nixos on /nix/var/nix/gcroots type ext4 (rw,relatime,stripe=32)
/dev/md/nixos on /nix/var/nix/profiles type ext4 (rw,relatime,stripe=32)
/dev/md/nixos on /etc/localtime type ext4 (ro,nosuid,nodev,relatime,stripe=32)

view this post on Zulip Srid (Feb 13 2024 at 11:17):

error (ignored): error: reached end of FramedSource
[..]
error: opening file '/nix/store/i8jjpg7im5jgr6dvr9ikylpa1szx1kpi-treefmt-check.lock': Permission denied

https://github.com/srid/haskell-flake/actions/runs/7885297185/job/21516934365

And this is happening only in Github Runner inside containers. I wonder if this is to do with the mounted nix-store access from the container.

Have you seen this? @Tim DeHerrera

view this post on Zulip Srid (Feb 13 2024 at 13:30):

I had to run the following to fix this:

nix-store --verify --repair --check-contents

But the problem reappears when doing the CI build again.

view this post on Zulip Tim DeHerrera (Feb 13 2024 at 14:42):

hmm, haven't seen that one yet, have you tried making the daemon socket writable (mine is)?

view this post on Zulip Srid (Feb 13 2024 at 14:44):

Nope, but I'll try that if this happens again.

view this post on Zulip Srid (Feb 14 2024 at 09:47):

TIL that you can just use a single personal access token (with 'repo' scope) for all runners across one's personal repositories.

view this post on Zulip Srid (Feb 14 2024 at 10:46):

easy-github-runners.nix

Now supports organizations:

            services.easy-github-runners = {
              "srid/emanote" = { };
              "srid/haskell-flake" = { };
              "srid/nixos-config" = { };
              "srid/ema" = { };
              "EmaApps/orgself".owner = "srid";
            };

https://github.com/srid/nixos-config/blob/master/nixos/easy-github-runners.nix

view this post on Zulip Tim DeHerrera (Feb 17 2024 at 17:47):

Just a heads up, I tried updating my nixos-unstable pin and somebody has modified the github runners in a way that my configuration now totally breaks. I cannot get the nix client to talk to the daemon and I am unsure why. I'll keep you posted if I figure it out

view this post on Zulip Tim DeHerrera (Feb 17 2024 at 18:51):

Alright so two things:

Firstly, they appear to have removed the old services.github-runner option and only have the newer services.github-runners option now, which led to a reorganization of the code, which means my snippet to import the options and use them in containers broke. So that's the first thing.

Even taking them out of the container and just running them directly on the host I was still failing to communicate with the daemon. This appears to be because the dynamically created users created by systemd were not trusted users. I thought I had already addressed that, but apparently not or I reverted it or something.

In any case, I was able to add the dynamic users to the trusted users list like so:
nix.settings.trusted-users = lib.genList (i: "github-runner-nrdos-${i}") 16

To match the names of the dynamically allocated users. I may swing back around and configure each runner in a minimal microvm for added isolation with just the host daemon socket mounted. But for now this is fine since I only have them configured for a private repository at this point.

view this post on Zulip Tim DeHerrera (Feb 17 2024 at 20:30):

seems kind of presumptuous to assume all runners should have Nix. Hopefully we can get this reviewed and merged quickly, cause I need it :sweat_smile:
https://github.com/NixOS/nixpkgs/pull/289607

view this post on Zulip Srid (Feb 18 2024 at 14:51):

I believe my module has always been using the new github-runners service.

By the way, nix-darwin PR just got merged: https://github.com/LnL7/nix-darwin/pull/859

I'll play with macOS support.

view this post on Zulip Tim DeHerrera (Feb 18 2024 at 17:11):

The service is not new, what's new is that the code has been restructured and my previous code broke.

view this post on Zulip Srid (Feb 21 2024 at 13:10):

Parallels VM

The runner failed with a new error,

The local machine's clock may be out of sync with the server time by more than five minutes. Please sync your clock with your domain or internet time and try again.

The local time on the container and host is incorrect, though. Wonder how that happened ... I'm running this in Parallels.

view this post on Zulip Srid (Feb 21 2024 at 13:13):

Yup, it was Parallels. Rebooting the VM fixed it.

view this post on Zulip Srid (Feb 21 2024 at 13:15):

Let's see if disabling automatic time sync fixes it long-term,

image.png

view this post on Zulip Srid (Feb 21 2024 at 16:34):

I made the mistake of not setting a good disk size. 60GB (the default) quickly ran out in CI. Neither boot.growPartition nor systemd repart worked to repartition root on boot, after I resized the disk image in Parallels.

view this post on Zulip Srid (Mar 25 2024 at 12:03):

New nix-darwin module

I upgraded nix-darwin which has a revamped github-runners module that explicitly writes the log files.

Which allowed me to actually observe crash logs like this:

-bash-3.2$ pwd
/var/lib/github-runners
-bash-3.2$ tail -5 nixci-1/_diag/Runner_20240325-120051-utc.log
[2024-03-25 12:00:53Z ERR  Runner] GitHub.DistributedTask.WebApi.TaskAgentExistsException: A runner exists with the same name nixci-1.
   at GitHub.Runner.Listener.Configuration.ConfigurationManager.ConfigureAsync(CommandSettings command) in /private/tmp/nix-build-github-runner-2.314.1.drv-0/src/src/Runner.Listener/Configuration/ConfigurationManager.cs:line 306
   at GitHub.Runner.Listener.Runner.ExecuteCommand(CommandSettings command) in /private/tmp/nix-build-github-runner-2.314.1.drv-0/src/src/Runner.Listener/Runner.cs:line 126
[2024-03-25 12:00:53Z ERR  Terminal] WRITE ERROR: A runner exists with the same name nixci-1.
[2024-03-25 12:00:53Z INFO Listener] Runner execution has finished with return code 1
-bash-3.2$

view this post on Zulip Srid (Mar 25 2024 at 12:03):

https://github.com/LnL7/nix-darwin/pull/893

view this post on Zulip Srid (Mar 25 2024 at 12:04):

As for this particular problem, there is replace = true; which solves the issue.

view this post on Zulip Notification Bot (Mar 25 2024 at 12:06):

3 messages were moved here from #nix > Darwin: launchctl logs for a service by Srid.

view this post on Zulip Srid (Mar 26 2024 at 21:12):

I decided to do this the other way. Setup runners on the NixOS VM, and do remote build on macOS as necessary.

Rough notes: https://github.com/srid/nixos-config/tree/master/clusters/github-runner

view this post on Zulip Srid (Jun 21 2024 at 23:38):

Alright, based on how I'm setting CI for Juspay Github, the new approach is not to use distributed builds, but just dedicate runners to each host.

Module & tutorial incoming, but here's a sneak preview:

https://x.com/sridca/status/1804286220304028106

view this post on Zulip Andreas (Jun 22 2024 at 07:37):

Awesome, I will look at this, once I am out of drama mindset thanks to the ongoing legacy NixOS community.

view this post on Zulip Srid (Jun 24 2024 at 19:52):

Here it is:

https://github.com/juspay/github-nix-ci

I'll announce it in a bit, after some final polish, and feedback (if any).

view this post on Zulip Srid (Jun 25 2024 at 14:40):

https://x.com/nixos_asia/status/1805609192843014276

view this post on Zulip Srid (Jun 25 2024 at 14:44):

https://discourse.nixos.org/t/github-nix-ci-for-self-hosting-github-runners-on-macos-linux/47642


Last updated: Nov 15 2024 at 13:04 UTC