Stream: nixify-llm

Topic: ollama


view this post on Zulip Srid (Feb 20 2024 at 17:29):

image.png

https://github.com/ollama/ollama

You can run it from nixpkgs. Run these two commands in separate terminals:

nix run nixpkgs#ollama serve
nix run nixpkgs#ollama run llama2-uncensored

Works on macOS, at least.

(The query I used to test the model is inspired by the one Musk used to demonstrate Grok)

view this post on Zulip Andreas (Feb 20 2024 at 17:54):

Well it might work if ROCm works. Which it doesn't if I try. So I would probably have to build a shell for ROCm or something. Otherwise I will be using CPU...

view this post on Zulip Andreas (Feb 20 2024 at 17:58):

In case anyone likes to try this on AMD, here is the command to create a nice running docker container on Linux:

docker run -d --privileged --device /dev/kfd -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_DEBUG=1 -e ROCR_VISIBLE_DEVICES="0" -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --name ollama ollama/ollama:0.1.24-rocm

The HSA_OVERRIDE_GFX_VERSION=10.3.0 environment variable is only needed if your GPU is not offically supported by ROCm. Which is most consumer and workstation GPUs actually, except a handful. Depending on the GPU it might or might not work.

And then you can do a

docker exec -it ollama ollama run llama2

Or whatever model you want to run

view this post on Zulip Srid (Feb 20 2024 at 18:02):

@Andreas Does the docker container run do anything special compared to pure Nix instance (nix run ..) above?

view this post on Zulip Andreas (Feb 20 2024 at 18:09):

well it has ROCm installed obviously so that the GPU can use it. However that is just the container for ROCm they provide. Juding from the dockerfile in the ollama GitHub repo, they build them on the basis of some CentOS images provides by AMD.

These in turn have their own repo, which I presume to be this one: https://github.com/ROCm/ROCm-docker

view this post on Zulip Andreas (Feb 20 2024 at 18:13):

This here might be the dockerfile for the base image on top of which ollama is compiling their GO binary:

https://github.com/ROCm/ROCm-docker/blob/master/dev/Dockerfile-centos-7-complete

view this post on Zulip Shivaraj B H (Mar 23 2024 at 20:00):

I created a process-compose-flake-module for ollama, this is for a personal project that I am working on and here’s how easy it is to setup:

{
  services.ollama = {
      enable = true;
      host = "100.71.49.133";
      # TODO: adding more than one model breaks in shellcheck
      models = [ "llama2-uncensored" ];
   };
}

view this post on Zulip Srid (Mar 23 2024 at 20:01):

nix run ... cmd?

view this post on Zulip Shivaraj B H (Mar 23 2024 at 20:03):

I will upstream this to services-flake, it is right now in the private repo as a module.

More about the project soon.

view this post on Zulip Shivaraj B H (Mar 23 2024 at 20:30):

Here it is: https://github.com/juspay/services-flake/pull/137

view this post on Zulip Shivaraj B H (Mar 23 2024 at 20:32):

The nix run command is a bit long though:

 nix run "github:juspay/services-flake/ollama?dir=test"#ollama --override-input services-flake github:juspay/services-flake/ollama

This starts up ollama server with no models though, let me use a smaller sample model in test.

view this post on Zulip Shivaraj B H (Mar 23 2024 at 20:34):

https://ollama.com/library/tinyllama looks like a good candidate

view this post on Zulip Shivaraj B H (Mar 23 2024 at 20:40):

Oh wait, I can’t really pull models in flake check because of sandbox mode. I think these models will have to be pre-fetched by another derivation.

view this post on Zulip Srid (Mar 23 2024 at 20:40):

A separate repo, that uses services-flake module, to provide a single nix run .. to get some of these models up running, including a chat interface even, would be cool.

view this post on Zulip Shivaraj B H (Mar 23 2024 at 21:02):

https://github.com/F1bonacc1/process-compose/issues/64 I can provide a single nix run command this, basically allowing users to interact with the chat process in the process-compose window

view this post on Zulip Shivaraj B H (Mar 23 2024 at 21:03):

https://github.com/F1bonacc1/process-compose/issues/64#issuecomment-1974895517

I will try it out with 0.88.0

view this post on Zulip Srid (Mar 23 2024 at 21:07):

Curious if there's a chat bot web app that can interact with ollama

view this post on Zulip Shivaraj B H (Mar 23 2024 at 21:09):

There is this on top of search result: https://github.com/HelgeSverre/ollama-gui

view this post on Zulip Srid (Mar 23 2024 at 21:10):

Let's ship it. Gonna be a rad demo of services-flake.

view this post on Zulip Shivaraj B H (Mar 23 2024 at 21:11):

https://github.com/open-webui/open-webui there’s also this (much more popular), but looks like it is bloated with a lot of features.

view this post on Zulip Shivaraj B H (Mar 23 2024 at 21:13):

I will give both the UIs a shot though

view this post on Zulip Shivaraj B H (Mar 23 2024 at 21:16):

I will try this tomorrow, going to get some sleep now

view this post on Zulip Srid (Mar 23 2024 at 21:16):

https://github.com/NixOS/nixpkgs/pull/275448

view this post on Zulip Shivaraj B H (Mar 23 2024 at 21:24):

Shivaraj B H said:

https://github.com/F1bonacc1/process-compose/issues/64#issuecomment-1974895517

I will try it out with 0.88.0

Update: I didn’t notice that the STDIN support is not there, so even with the current PTY support, this isn’t possible in the process-compose window. Gotta take the WebUI route, anyways, the webui demo will be much cooler.

view this post on Zulip Shivaraj B H (Mar 26 2024 at 06:18):

Screenshot-2024-03-25-at-11.47.58PM.png
I have the UI running, it involved a lot of hacks to get it running because of the way open-webui uses the python backend.

view this post on Zulip Shivaraj B H (Mar 31 2024 at 23:45):

nix run github:shivaraj-bh/nixify-ollama

view this post on Zulip Andreas (Apr 01 2024 at 09:05):

Unsurprisingly it does not use the GPU on my machine. We'd probably have to add all the ROCm related components.

view this post on Zulip Andreas (Apr 01 2024 at 09:06):

But other than that, nice work!

view this post on Zulip Andreas (Apr 01 2024 at 09:10):

oh I was a bit too quick: webui cannot find any models on my end...

view this post on Zulip Shivaraj B H (Apr 01 2024 at 11:21):

I faced this problem, I had to manually change the ollama server url and press on the refresh button. Will look for a way to fix this

view this post on Zulip Shivaraj B H (Apr 01 2024 at 11:21):

Anyways, open-webui is a bit too much, I am looking for a simper UI, which is easier to setup

view this post on Zulip Andreas (Apr 01 2024 at 11:22):

open-webui actually looks quite nice. I personally use my containerized ollama with other clients though.

view this post on Zulip Andreas (Apr 01 2024 at 11:22):

maybe we can work at making this ollama flake compatible with ROCm at some point. Did you test it with CUDA?

view this post on Zulip Shivaraj B H (Apr 01 2024 at 11:24):

GPU acceleration is next on my list after I get this UI to work out of the box.

view this post on Zulip Andreas (Apr 01 2024 at 11:25):

I mean it should just pick up the GPU if available. It's just that I don't keep the ROCm libraries on my system if I don't need them specifically. So there is not much to be found.

view this post on Zulip Shivaraj B H (Apr 01 2024 at 11:25):

Andreas said:

open-webui actually looks quite nice. I personally use my containerized ollama with other clients though.

Yes, it does. I only hate how I can’t configure things like disabling web auth. When I am using it in my private network, or for development, I don’t really need it.

view this post on Zulip Andreas (Apr 01 2024 at 11:29):

I only hate how I can’t configure things like disabling web auth

Yes, it appears to have feature you might rather want for a somewhat larger deployment, like load balancing and such

view this post on Zulip Shivaraj B H (Apr 01 2024 at 11:31):

oh I was a bit too quick: webui cannot find any models on my end…

Screenshot-2024-04-01-at-4.58.42PM.png

The problem appears to be that I am not prefixing the protocol before the IP, as soon as i do that, it starts working. I will fix it

view this post on Zulip Shivaraj B H (Apr 01 2024 at 11:37):

I have also included an Up Next section in README: https://github.com/shivaraj-bh/nixify-ollama/blob/main/README.md

To track what I will be doing next

view this post on Zulip Shivaraj B H (Apr 01 2024 at 12:12):

Shivaraj B H said:

oh I was a bit too quick: webui cannot find any models on my end…

Screenshot-2024-04-01-at-4.58.42PM.png

The problem appears to be that I am not prefixing the protocol before the IP, as soon as i do that, it starts working. I will fix it

I have fixed it, select model should now work without requiring any hacks

view this post on Zulip Andreas (Apr 01 2024 at 16:24):

Yes, it's working now. Just a bit slow for my taste on CPU.

view this post on Zulip Andreas (Apr 01 2024 at 16:24):

but we are getting there

view this post on Zulip Andreas (Apr 01 2024 at 16:25):

I might look and see if I can get it to run with ROCm. So far I am only using containers for ROCm. But the stuff should all be in nixpkgs.

view this post on Zulip Shivaraj B H (Apr 01 2024 at 16:37):

Yes, I will continue working tomorrow. I am going to hit the sack for the day

view this post on Zulip Andreas (Apr 01 2024 at 16:38):

yeah it's somewhat late already where you are :smile:

view this post on Zulip Andreas (Apr 02 2024 at 17:21):

I decided to create an issue on your repo for ROCm support @Shivaraj B H

view this post on Zulip Shivaraj B H (Apr 02 2024 at 17:26):

Yup, I was checking out macOS support today, while I was in parallel setting up my new machine with Nvidia drivers to try out GPU acceleration

view this post on Zulip Shivaraj B H (Apr 02 2024 at 17:29):

I don’t have AMD hardware to test out ROCm, but I suppose I can test it on some cheap cloud instances

view this post on Zulip Andreas (Apr 02 2024 at 17:29):

that is nice, maybe we can adapt the Nvidia settings to run AMD stuff. Nice thing about ROCm is that it's just running on top of the AMDGPU kernel drivers.

view this post on Zulip Andreas (Apr 02 2024 at 17:29):

Shivaraj B H said:

I don’t have AMD hardware to test out ROCm, but I suppose I can test it on some cheap cloud instances

Or I'd be the one testing it.

view this post on Zulip Andreas (Apr 02 2024 at 17:37):

Is there an easy way to export an environment variable depending on the user's GPU architecture?

view this post on Zulip Andreas (Apr 02 2024 at 17:38):

Also, since I am on a multi GPU system, I'd need a way to set the GPU that ollama is to be using. In docker that is fairly easy to do. I wonder how to do this here.

view this post on Zulip Shivaraj B H (Apr 02 2024 at 17:52):

Andreas said:

Is there an easy way to export an environment variable depending on the user's GPU architecture?

What would this environment variable be used for?

view this post on Zulip Shivaraj B H (Apr 02 2024 at 17:53):

Andreas said:

Also, since I am on a multi GPU system, I'd need a way to set the GPU that ollama is to be using. In docker that is fairly easy to do. I wonder how to do this here.

How does one do this in docker? Multi-gpu scenario is new to me as well, so gotta play around with it a bit

view this post on Zulip Andreas (Apr 02 2024 at 17:57):

Yes, so on the env variable front most user will have to set HSA_OVERRIDE_GFX_VERSION=10.3.0 if they are on RDNA2 or RDNA arch GPUs. On RDNA3 this will be a different value. And it might not be needed if the GPU is officially supported by ROCm (which only very few are outside the data center).

For selecting the second GPU in my machine, I pass --device /dev/kfd and --device /dev/dri/renderD129 to the container. While --device /dev/dri/renderD128 would select the first GPU.

In addition I set the ROCR_VISIBLE_DEVICES to either 0 or 1, which, I think, is there for isolating the GPUs.

view this post on Zulip Shivaraj B H (Apr 02 2024 at 18:06):

Does ollama detect the GPU correctly inside a docker container?

view this post on Zulip Andreas (Apr 02 2024 at 18:23):

yes, no issues. I am using ollama's own container images. They are probably built on top of AMD's official images for ROCm.

view this post on Zulip Andreas (Apr 02 2024 at 18:27):

I think a big benefit of having this running via Nix would be a reduced footprint compared to these somewhat massive ROCm container images.

view this post on Zulip Andreas (Apr 02 2024 at 18:28):

ollama/ollama   0.1.24-rocm  9d567aacf463   7 weeks ago    20.9GB

That is the last image I pulled. Smaller might be better :grinning_face_with_smiling_eyes:

view this post on Zulip Shivaraj B H (Apr 02 2024 at 18:29):

I might have to look at how the container images are built, but I am quite certain it should be possible on native as well.

view this post on Zulip Andreas (Apr 02 2024 at 18:30):

it should be, yes

I have just seen that they added a bit more documentation to their images:

https://hub.docker.com/r/bergutman/ollama-rocm

view this post on Zulip Andreas (Apr 02 2024 at 18:31):

ah no, this is not the offical one

view this post on Zulip Andreas (Apr 02 2024 at 18:32):

I guess this should be the official dockerfile:

https://github.com/ollama/ollama/blob/main/Dockerfile

view this post on Zulip Shivaraj B H (Apr 02 2024 at 19:09):

Andreas said:

ollama/ollama   0.1.24-rocm  9d567aacf463   7 weeks ago    20.9GB

That is the last image I pulled. Smaller might be better :grinning_face_with_smiling_eyes:

Damn, I just noticed the image size is 20GB

view this post on Zulip Andreas (Apr 02 2024 at 19:11):

yes, and since I already have multiple such images for different apps, you can see how that can easily become a bit cumbersome. Let's just say the image size is one of the many places AMD's ROCm could use some optimization.

But I am happy that is working more or less nicely out of the box now. That is already a big thing. So they might get there eventually.

view this post on Zulip Shivaraj B H (Apr 02 2024 at 19:52):

It might just be very simple to enable ROCm, I am trying something out with CUDA, if that works out, I will post here and you can tell me if it works for you

view this post on Zulip Andreas (Apr 02 2024 at 19:57):

the only CUDA device I have is my laptop GPU with 2GB of VRAM. So you'd need to set a very small model for that. (Btw. defining models more dynamically might also be a nice add-on for this)

view this post on Zulip Andreas (Apr 02 2024 at 19:57):

but once you have a CUDA implementation going, I can see if I can derive a ROCm implementation from that one

view this post on Zulip Shivaraj B H (Apr 02 2024 at 20:01):

Shivaraj B H said:

It might just be very simple to enable ROCm, I am trying something out with CUDA, if that works out, I will post here and you can tell me if it works for you

my bad, I wasn’t clear here.

This is what I am trying for CUDA:

services.ollama."ollama" = {
   enable = true;
   package = pkgs.ollama.override { acceleration = "cuda"; };
   host = "0.0.0.0";
    models = [ "llama2-uncensored" ];
};

What I want you to try is:

services.ollama."ollama" = {
   enable = true;
   package = pkgs.ollama.override { acceleration = rocm"; };
   host = "0.0.0.0";
    models = [ "llama2-uncensored" ];
};

view this post on Zulip Andreas (Apr 02 2024 at 20:02):

okay, but for that I'd need to actually provide the ROCm libraries somewhere, right? Otherwise there is nothing to pick up. Also ollama should detect that automatically, if it has been compiled with ROCm support. At least I believe it should.

view this post on Zulip Shivaraj B H (Apr 02 2024 at 20:03):

but for that I'd need to actually provide the ROCm libraries somewhere, right?

nixpkgs does it for you: https://github.com/NixOS/nixpkgs/blob/8a22284f51fcd7771ee65ba124175bf9b90505ad/pkgs/tools/misc/ollama/default.nix#L51-L62

view this post on Zulip Shivaraj B H (Apr 02 2024 at 20:04):

package = pkgs.ollama.override { acceleration = “rocm"; };

when you use the above override it will build ollama again with rocm libraries in its env, atleast that is what I believe it does.

view this post on Zulip Andreas (Apr 02 2024 at 20:05):

alright, I will clone the repo and try if that works tomorrow

view this post on Zulip Shivaraj B H (Apr 02 2024 at 20:09):

after the override for cuda, ollama detected the libraries:
(from the logs of ollama serve)

time=2024-04-02T20:07:05.102Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/nix/store/ikmrvm9hp2j7mka3li49cycx8mzbw080-nvidia-x11-550.67-6.6.23/lib/libnvidia-ml.so.550.67]"

but it fails at a later stage:

nvmlInit_v2 err: 18
time=2024-04-02T20:07:05.106Z level=INFO source=gpu.go:249 msg="Unable to load CUDA management library /nix/store/ikmrvm9hp2j7mka3li49cycx8mzbw080-nvidia-x11-550.67-6.6.23/lib/libnvidia-ml.so.550.67: nvml vram init failure: 18"

I will investigate this tomorrow

view this post on Zulip Andreas (Apr 02 2024 at 20:10):

I am suspecting now already that for ROCm you might need other components too. Like rocmPackages.rocm-runtime for instance, or rocmPackages.rocm-smi maybe. Not sure.

view this post on Zulip Shivaraj B H (Apr 02 2024 at 20:13):

Yup, let’s check that out tomorrow. I will head out for today

view this post on Zulip Srid (Apr 02 2024 at 20:27):

/me hasn't had a chance check this out because he is yet to find a reliable internet connection in Australia

view this post on Zulip Shivaraj B H (Apr 02 2024 at 20:57):

macOS support is still a WIP for the open-webui backend, works well on linux for now. It is because the lock.json only locks for one platform atm. I have to switch to pdm for package management instead of pip, ensuring multi-platform support.

view this post on Zulip Shivaraj B H (Apr 02 2024 at 20:59):

Shivaraj B H said:

after the override for cuda, ollama detected the libraries:
(from the logs of ollama serve)

time=2024-04-02T20:07:05.102Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/nix/store/ikmrvm9hp2j7mka3li49cycx8mzbw080-nvidia-x11-550.67-6.6.23/lib/libnvidia-ml.so.550.67]”

but it fails at a later stage:

nvmlInit_v2 err: 18
time=2024-04-02T20:07:05.106Z level=INFO source=gpu.go:249 msg="Unable to load CUDA management library /nix/store/ikmrvm9hp2j7mka3li49cycx8mzbw080-nvidia-x11-550.67-6.6.23/lib/libnvidia-ml.so.550.67: nvml vram init failure: 18”

I will investigate this tomorrow

@Andreas turns out it was just incompatibility between the driver version installed on my system vs the respective library exposed to ollama by the override. I matched them and now it works :tada:

view this post on Zulip Shivaraj B H (Apr 02 2024 at 21:00):

The speed difference is huge, like 20x. On my CPU I get about 10-11 tokens/second and with RTX 3070 8GB vram mobile GPU, I get about 200-210 tokens/second

view this post on Zulip Shivaraj B H (Apr 02 2024 at 21:14):

Wait, The token per second is for prompt eval, generation eval is much slower for both of them. For CPU it is 3.9 tokens/second and for GPU it is 24.7 tokens/seconds, about 10x better on GPU.

view this post on Zulip Andreas (Apr 03 2024 at 08:44):

yes, I will try later today. Where can you see the generation speed btw.?

view this post on Zulip Andreas (Apr 03 2024 at 08:44):

but the huge speed increase is to be expected. That's why I wanted GPU to work so badly. :grinning_face_with_smiling_eyes:

view this post on Zulip Shivaraj B H (Apr 03 2024 at 08:50):

Andreas said:

yes, I will try later today. Where can you see the generation speed btw.?

You can find it in the logs of ollama serve process after a request is complete. I am thinking of exposing a benchmark script that can be run to compare numbers on different hardware.

view this post on Zulip Andreas (Apr 03 2024 at 08:51):

ah so it's in the logs. I never bothered to look at them so far. But I'd like to compare my two AMD cards with your Nvidia 3070.

Also speed will depend on the model I guess. You are still using llama-7b-uncensored, right?

view this post on Zulip Shivaraj B H (Apr 03 2024 at 08:54):

Andreas said:

ah so it's in the logs. I never bothered to look at them so far. But I'd like to compare my two AMD cards with your Nvidia 3070.

Also speed will depend on the model I guess. You are still using llama-7b-uncensored, right?

Yup, I am using llama-7b-uncensored. I tried with llama-70b-uncensored on my GPU, it is extremely slow, like 0.3-0.5 tokens/second

view this post on Zulip Shivaraj B H (Apr 03 2024 at 08:55):

Didn’t even bother trying on CPU

view this post on Zulip Shivaraj B H (Apr 03 2024 at 08:56):

I will start playing around in the night today, after I am done with work

view this post on Zulip Andreas (Apr 03 2024 at 08:58):

Actually you did bother trying on CPU. :big_smile:

Because ollama most likely did fall back to CPU as you ran out of VRAM with your 8 GB.

view this post on Zulip Andreas (Apr 03 2024 at 08:59):

but for smaller models, there are quite a few capable models for code-related tasks that fit into 8GB of VRAM. And even a good amount that might fit my 2 GB of VRAM on my somewhat older laptop.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 09:23):

Because ollama most likely did fall back to CPU as you ran out of VRAM with your 8 GB.

Right, that makes sense.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 10:48):

E972A83F-3126-43BE-B30F-BC656E551C9A.jpg

Now, I can access it on my phone as well

view this post on Zulip Andreas (Apr 03 2024 at 13:31):

I am right now trying this snippet of code:

services.ollama."ollama" = {
   enable = true;
   package = pkgs.ollama.override { acceleration = “rocm"; };
   host = "0.0.0.0";
    models = [ "llama2-uncensored" ];
};

However the result is that Nix wants to compile rocblas-6.0.2 from source... or at least that what my CPU fans sound like. Not ideal. I tried compiling the ROCm stack from source once. I have a Ryzen 3900x, but even with that machine this will take several hours, depending on how much it needs to compile.

That being said, how can I specify runtime dependencies for the service I am declaring? Because I have a feeling I might need some, as I do not have ROCm runtime packages available on my system by default.

view this post on Zulip Andreas (Apr 03 2024 at 13:36):

I will let it do it's building and see what happens (most likely it will not work, since... ROCm)

view this post on Zulip Shivaraj B H (Apr 03 2024 at 14:00):

However the result is that Nix wants to compile rocblas-6.0.2 from source

assuming you are on x86_64-linux, it should be coming from cache: https://hydra.nixos.org/build/254389608

view this post on Zulip Andreas (Apr 03 2024 at 14:00):

that would make sense, however it did not

view this post on Zulip Andreas (Apr 03 2024 at 14:02):

right now it is stuck for some reason for the last 10 min so telling me:

[1/1/10 built, 57 copied (9806.3/9806.5 MiB), 1089.1 MiB DL] building rocblas-6.0.2 (buildPhase): 4 warnings generated when compiling for gfx942.

I will wait a bit longer and see if it back up on its feet again

view this post on Zulip Shivaraj B H (Apr 03 2024 at 14:02):

I will let it do it's building and see what happens (most likely it will not work, since... ROCm)

The only problem I would see is the mismatch of versions between drivers installed on your OS with that of rocmPackages in the nixpkgs commit we are using in nixify-ollama’s flake.nix

view this post on Zulip Shivaraj B H (Apr 03 2024 at 14:03):

Andreas said:

right now it is stuck for some reason for the last 10 min so telling me:

[1/1/10 built, 57 copied (9806.3/9806.5 MiB), 1089.1 MiB DL] building rocblas-6.0.2 (buildPhase): 4 warnings generated when compiling for gfx942.

I will wait a bit longer and see if it back up on its feet again

It definitely is building from scratch, not a good sign.

view this post on Zulip Andreas (Apr 03 2024 at 14:03):

mismatch of versions between drivers installed on your OS

The drivers are the Linux kernel drivers for my kernel version. Which is the LTS kernel 6.1.82 right now.

It definitely is building from scratch, not a good sign.

Might this be due to the ollama package being overridden in the nixify flake?

view this post on Zulip Shivaraj B H (Apr 03 2024 at 14:08):

Might this be due to the ollama package being overridden in the nixify flake?

That should rebuild only the ollama package and not the rocmPackages

view this post on Zulip Andreas (Apr 03 2024 at 14:09):

okay, my base system is on NixOS stable, not unstable. Would that have an impact here?

view this post on Zulip Shivaraj B H (Apr 03 2024 at 14:10):

Andreas said:

okay, my base system is on NixOS stable, not unstable. Would that have an impact here?

probably, let’s wait for it to build. In any case, you can override rocmPackages as well (to match your system’s stable release), so not a problem

view this post on Zulip Andreas (Apr 03 2024 at 14:35):

since it's not moving anywhere, I decided to cancel it now

view this post on Zulip Andreas (Apr 03 2024 at 14:36):

yes, that is a bit of a sad state if Nix wants to rebuild stuff from scratch

view this post on Zulip Andreas (Apr 03 2024 at 14:37):

but since your flake itself uses nixpkgs unstable, normally it should pull in the binaries nonetheless, right?

view this post on Zulip Andreas (Apr 03 2024 at 14:45):

okay, so now I decided to downgrade your flake to the NixOS-23.11 branch, and now it is only compiling ollama itself.

view this post on Zulip Andreas (Apr 03 2024 at 17:29):

okay, so it compiled, but the webui now gets me an internal server error when trying to talk to model.

view this post on Zulip Andreas (Apr 03 2024 at 17:33):

When I do a

curl http://localhost:11434/api/generate -d '{
  "model": "llama2-uncensored",
  "prompt":"Why is the sky blue?"
}'

I get curl: (52) Empty reply from server

view this post on Zulip Andreas (Apr 03 2024 at 17:37):

when I run that on my docker llama2, there is no issue, I get a stream of individual response tokens

view this post on Zulip Andreas (Apr 03 2024 at 17:42):

how would I set the environment for the process compose flake services? I'd say passing OLLAMA_DEBUG=1 might help find out what isn't moving there...

view this post on Zulip Andreas (Apr 03 2024 at 17:54):

As for stats, I get this on my GPU:

[1712166751] print_timings: prompt eval time =      25.77 ms /     0 tokens (     inf ms per token,     0.00 tokens per second)
[1712166751] print_timings:        eval time =    2132.53 ms /   118 runs   (   18.07 ms per token,    55.33 tokens per second)
[1712166751] print_timings:       total time =    2158.30 ms

view this post on Zulip Andreas (Apr 03 2024 at 17:55):

no idea if this is good or bad. But the response comes pretty fast I'd say

view this post on Zulip Shivaraj B H (Apr 03 2024 at 17:59):

I’d need to expose an option to add extraEnvs

view this post on Zulip Andreas (Apr 03 2024 at 17:59):

and this is from my secondary GPU

[1712167117] print_timings: prompt eval time =      95.26 ms /    28 tokens (    3.40 ms per token,   293.94 tokens per second)
[1712167117] print_timings:        eval time =    5266.83 ms /   212 runs   (   24.84 ms per token,    40.25 tokens per second)
[1712167117] print_timings:       total time =    5362.09 ms

view this post on Zulip Shivaraj B H (Apr 03 2024 at 18:00):

These stats are from within the docker, is it?

view this post on Zulip Shivaraj B H (Apr 03 2024 at 18:01):

What GPUs are you running on your machine?

view this post on Zulip Andreas (Apr 03 2024 at 18:01):

Yeah the stats are from within the docker containers

Great idea having some env to set. Let's see how you implement it :smile:

Is there a way to attach to the tty of the running process in process compose? That way I could see what ollama spits out in debug messages

view this post on Zulip Andreas (Apr 03 2024 at 18:01):

Shivaraj B H said:

What GPUs are you running on your machine?

1) Radeon Pro W6800
2) Radeon Pro W6600

view this post on Zulip Shivaraj B H (Apr 03 2024 at 18:01):

You could do “nix run — -t=false” to disable tui mode in process-compose

view this post on Zulip Andreas (Apr 03 2024 at 18:03):

So you 3070 was getting 25 tokens per second, yes? I find it odd that the W6600 should be that much faster with 40 tokens per second.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 18:03):

Andreas said:

Shivaraj B H said:

What GPUs are you running on your machine?

1) Radeon Pro W6800
2) Radeon Pro W6600

That’s 32 gb and 8gb vram respectively?

view this post on Zulip Shivaraj B H (Apr 03 2024 at 18:04):

Andreas said:

So you 3070 was getting 25 tokens per second, yes? I find it odd that the W6600 should be that much faster with 40 tokens per second.

Does that depend on the prompt maybe? I will try giving the same prompt as you

view this post on Zulip Shivaraj B H (Apr 03 2024 at 18:05):

And is your GPU over clocked by any chance?

view this post on Zulip Andreas (Apr 03 2024 at 18:05):

that is possible, I was just trying the prompt via curl that ollama has on github. Just with the different model

curl http://localhost:11434/api/generate -d '{
  "model": "llama2-uncensored",
  "prompt":"Why is the sky blue?"
}'

view this post on Zulip Andreas (Apr 03 2024 at 18:07):

Shivaraj B H said:

And is your GPU over clocked by any chance?

Nope, rather underclocked if compared to the gaming variant. The TDP for the Radeon Pro W6600 is basically locked at around 100W, it's s one slot card, which is very nice. The same goes for the W6800, which is the equivalent of the RX 6800, but uses less power. It does however have twice the VRAM.

view this post on Zulip Andreas (Apr 03 2024 at 18:08):

When it comes to gaming your 3070 should be rather close to my W6800 I'd say.

view this post on Zulip Andreas (Apr 03 2024 at 18:23):

Shivaraj B H said:

You could do “nix run — -t=false” to disable tui mode in process-compose

copy pasting this will only get me error: unrecognised flag '-t'

view this post on Zulip Shivaraj B H (Apr 03 2024 at 18:53):

Andreas said:

Shivaraj B H said:

You could do “nix run — -t=false” to disable tui mode in process-compose

copy pasting this will only get me error: unrecognised flag '-t’

My bad, it is nix run .#default -- -t=false

view this post on Zulip Andreas (Apr 03 2024 at 18:58):

(Yes that was me being stupid / lazy I guess)

okay let's see... there are quite a few errors coming from ollama

view this post on Zulip Shivaraj B H (Apr 03 2024 at 18:58):

Also, I have implemented the extraEnvs option. You can pull the latest changes on nixify-ollama and use this:

{
  services.ollama."ollama" = {
    enable = true;
    host = "0.0.0.0";
    models = [ "llama2-uncensored" ];
    extraEnvs = {
      OLLAMA_DEBUG="1";
    };
  };
}

view this post on Zulip Andreas (Apr 03 2024 at 18:59):

and at some point among the gigantic mess we see this:

[ollama ] CUDA error: invalid device function
[ollama ]   current device: 0, in function ggml_cuda_op_flatten at /build/source/llm/llama.cpp/ggml-cuda.cu:10012
[ollama ]   hipGetLastError()
[ollama ] GGML_ASSERT: /build/source/llm/llama.cpp/ggml-cuda.cu:256: !"CUDA error"
[ollama ] loading library /tmp/ollama1196279674/rocm/libext_server.so
[ollama ] SIGABRT: abort
[ollama ] PC=0x7f95e13c3ddc m=10 sigcode=18446744073709551610
[ollama ] signal arrived during cgo execution

My assumption is that I need to set a env var to get over this. So I will pull and try again.

view this post on Zulip Andreas (Apr 03 2024 at 19:16):

Success!!!

services.ollama."ollama" = {
  enable = true;
  package = pkgs.ollama.override { acceleration = "rocm"; };
  host = "0.0.0.0";
  models = [ "llama2-uncensored" ];
  extraEnvs = {
    HSA_OVERRIDE_GFX_VERSION = "10.3.0";
    OLLAMA_DEBUG = "1";
  };

view this post on Zulip Andreas (Apr 03 2024 at 19:17):

If I ask it about the Roman empire, I still get 56 tokens per second on the W6800

view this post on Zulip Andreas (Apr 03 2024 at 19:19):

sometimes to goes down to 48-49 tokens per second. but the model does not like to give long responses it seems

view this post on Zulip Andreas (Apr 03 2024 at 19:19):

but at least it is running now

view this post on Zulip Shivaraj B H (Apr 03 2024 at 19:20):

Andreas said:

If I ask it about the Roman empire, I still get 56 tokens per second on the W6800

Is that the same in docker with GPU acceleration?

view this post on Zulip Shivaraj B H (Apr 03 2024 at 19:20):

Andreas said:

Success!!!

services.ollama."ollama" = {
  enable = true;
  package = pkgs.ollama.override { acceleration = "rocm"; };
  host = 0.0.0.0;
  models = [ "llama2-uncensored” ];
  extraEnvs = {
    HSA_OVERRIDE_GFX_VERSION = “10.3.0”;
    OLLAMA_DEBUG = “1”;
  };

Great, will star these messages to document them later.

view this post on Zulip Andreas (Apr 03 2024 at 19:21):

Shivaraj B H said:

Andreas said:

If I ask it about the Roman empire, I still get 56 tokens per second on the W6800

Is that the same in docker with GPU acceleration?

I guess so, more or less

view this post on Zulip Andreas (Apr 03 2024 at 19:21):

I mean we could try and think how to structure the flake so that you can run with different GPU architectures.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 19:22):

I mean we could try and think how to structure the flake so that you can run with different GPU architectures.

I will be providing two different nix runnable apps, nix run .#with-cuda and nix run .#with-rocm. This is what I am thinking for now

view this post on Zulip Andreas (Apr 03 2024 at 19:23):

I'd just go for #cuda and #rocm ... and I think I'd also really like to disable the webui.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 19:24):

webui, I am thinking only for the default app. Can disable it for others

view this post on Zulip Shivaraj B H (Apr 03 2024 at 19:25):

let me add cuda and rocm really quick

view this post on Zulip Andreas (Apr 03 2024 at 19:25):

I'd say let me (or the consumer) choose to enable to disable.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 19:26):

In the future, I am also planning to export a home-manager module that will support systemd config for linux and launchd config for mac. To allow running ollama server in the background

view this post on Zulip Andreas (Apr 03 2024 at 19:27):

I mean there is already a nixos module, isn't there?

view this post on Zulip Shivaraj B H (Apr 03 2024 at 19:27):

Yes, I can take inspiration from it, but can’t reuse it in macos or other linux distros

view this post on Zulip Andreas (Apr 03 2024 at 19:28):

yes, that is true.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:22):

Added support for cuda and rocm:

{
  default = {
    imports = [ common ];
    services.ollama-stack.open-webui.enable = true;
  };
  cuda = {
    imports = [ common ];
    services.ollama-stack.extraOllamaConfig = {
      package = pkgs.ollama.override { acceleration = "cuda"; };
    };
  };
  rocm = {
    imports = [ common ];
    services.ollama-stack.extraOllamaConfig = {
       package = pkgs.ollama.override { acceleration = "rocm"; };
     };
   };
}

Above is the definition for three apps, you can enable or disable open-webui on any of them. You can even pass extraOllamaConfig, I didn’t add the HSA_OVERRIDE_GFX_VERSION yet because I want to understand more about why it is needed and how to determine the version

view this post on Zulip Srid (Apr 03 2024 at 20:27):

I have a Hetzner dedicated server (x86_64-linux) now; how do I try this out?

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:28):

Srid said:

I have a Hetzner dedicated server (x86_64-linux) now; how do I try this out?

nix run github:shivaraj-bh/nixify-ollama

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:32):

Shivaraj B H said:

Srid said:

I have a Hetzner dedicated server (x86_64-linux) now; how do I try this out?

nix run github:shivaraj-bh/nixify-ollama

And then to test it:

curl http://<hetzner-machine-ip>:11434/api/generate -d '{
  "model": “llama2-uncensored",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

or

open http://<hetzner-machine-ip>:1111

to open open-webui

view this post on Zulip Srid (Apr 03 2024 at 20:34):

What does the curl command do exactly? Why can't I type that prompt in the webui itself?

view this post on Zulip Srid (Apr 03 2024 at 20:34):

(I have firewall setup so this means I'd have to setup port forwarding for 1111 the webui, which is fine, but why should I have to do it for 11434?)

view this post on Zulip Andreas (Apr 03 2024 at 20:36):

@Shivaraj B H

On the HSA_OVERRIDE_GFX_VERSION variable. I didn't find any documentation on it. However, the whole ROCm stack can be compiled for different LLVM targets, which you can find here:

https://llvm.org/docs/AMDGPUUsage.html (there are quite a few recent ones not documented there, no idea why not)

You will see that gfx1030 corresponds to my Radeon Pro W6800 (which is basically the same as any consumer RX 6800). This is where I got the number from. Now usufally that should not be necessary for my card as it is one of the few which has official support. However as we have seen, it still is necessary to override this. As it will be for any other RDNA2 based card, and probably RDNA1. Basically you fake having a Radeon RX 6800 architecture to ROCm so that it runs anyways. Which works just fine on my W6600 for instance.

If you have a more recent RDNA3 based card, you will need a different override. Most likely SA_OVERRIDE_GFX_VERSION=11.0.0

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:37):

Srid said:

What does the curl command do exactly? Why can't I type that prompt in the webui itself?

You can do the same thing from webui as well

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:38):

Srid said:

(I have firewall setup so this means I'd have to setup port forwarding for 1111 the webui, which is fine, but why should I have to do it for 11434?)

only 1111 should suffice

view this post on Zulip Srid (Apr 03 2024 at 20:38):

image.png

Should I sign up?

view this post on Zulip Andreas (Apr 03 2024 at 20:38):

Srid said:

What does the curl command do exactly? Why can't I type that prompt in the webui itself?

The curl command is just talking to ollama's API directly. That API is running on port 11434. Which the webui is most likely talking to as well.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:39):

Srid said:

image.png

Should I sign up?

Yes, I was looking for ways to get rid of it. But open-webui has it hardcoded. You can signup with any dummy mail

view this post on Zulip Srid (Apr 03 2024 at 20:39):

As a "noob user" looking to explore this, I'd just want to do the minimal thing necessary to get this up and running. Would be good to document it in README

view this post on Zulip Srid (Apr 03 2024 at 20:40):

Okay I created an account. I suppose it stores it in local DB?

view this post on Zulip Srid (Apr 03 2024 at 20:40):

image.png

Nice! :tada:

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:41):

Srid said:

As a "noob user" looking to explore this, I'd just want to do the minimal thing necessary to get this up and running. Would be good to document it in README

Yup, I was looking for simpler alternatives

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:41):

Srid said:

Okay I created an account. I suppose it stores it in local DB?

Yes

view this post on Zulip Srid (Apr 03 2024 at 20:42):

This would be a great example for services-flake.

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:44):

If you port forward 11434 you can use the Enchanted ios client to get a nice looking mobile and desktop app as well.

view this post on Zulip Andreas (Apr 03 2024 at 20:45):

So @Shivaraj B H you broke ROCm again because the HSA_OVERRIDE_GFX_VERSION variable isn't set... how do you propose to set it so I don't have to walk into ./nix/ollama.nix?

view this post on Zulip Andreas (Apr 03 2024 at 20:45):

(Maybe we do that tomorrow :big_smile: I'll check out for today!)

view this post on Zulip Shivaraj B H (Apr 03 2024 at 20:46):

Andreas said:

So Shivaraj B H you broke ROCm again because the HSA_OVERRIDE_GFX_VERSION variable isn't set... how do you propose to set it so I don't have to walk into ./nix/ollama.nix?

You can add the envs here: https://github.com/shivaraj-bh/nixify-ollama/blob/017dca208fbec393f8c5c6b574c1c1234df176ce/flake.nix#L70-L72

view this post on Zulip Shivaraj B H (Apr 03 2024 at 21:30):

The repo is now renamed to ollama-flake: https://github.com/shivaraj-bh/ollama-flake/issues/2

view this post on Zulip Shivaraj B H (Apr 03 2024 at 22:01):

@Andreas I get about 77 tokens/second on the “why is the sky blue?” prompt, with GPU:

[ollama ] {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time     =      62.70 ms /    28 tokens (    2.24 ms per token,   446.56 t
okens per second)","n_prompt_tokens_processed":28,"n_tokens_second":446.5638506562894,"slot_id":0,"t_prompt_processing":62.701,"t_token":2.2393214285714285,"
task_id":0,"tid":"139803154691776","timestamp":1712181596}
[ollama ] {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time =    1717.36 ms /   133 runs   (   12.91 ms per token,    77.44 t
okens per second)","n_decoded":133,"n_tokens_second":77.44463000100154,"slot_id":0,"t_token":12.91245112781955,"t_token_generation":1717.356,"task_id":0,"tid
":"139803154691776","timestamp":1712181596}

view this post on Zulip Andreas (Apr 04 2024 at 10:40):

I have a feeling you should have some commented out env vars right in the main flake.nix ... useability.

Yes I can get the 55 token / second I reported on the W6800. What you got now is more what I expected. Nvidia should be faster at this. Also because CUDA will be more otimized than ROCm. We might try ROCm 6.0 at some point and see if this improves performance on AMD, because right now this is ROCm 5.7 I am running.

view this post on Zulip Andreas (Apr 04 2024 at 10:42):

Also funny effect I had: ollama wouldn't shut down properly some .ollama-unwrap process got stuck, didn't release the bound port 11434 and even on reboot systemd had quite some work to do to get the process killed. Not sure what happened there...

view this post on Zulip Andreas (Apr 04 2024 at 10:52):

Next thing I'd do for usability is add a catered list of models that are commented out by default (because let's be honest llama2-7b is not the more talkative buddy) and document that a bit. So users can choose what they want.

view this post on Zulip Shivaraj B H (Apr 04 2024 at 11:35):

I have a feeling you should have some commented out env vars right in the main flake.nix ... useability.

Right, the primary one’s I would see people using is CUDA_VISIBLE_DEVICES HIP_VISIBLE_DEVICES and of course HSA_OVERRIDE_GFX_VERSION

view this post on Zulip Shivaraj B H (Apr 04 2024 at 11:36):

What you got now is more what I expected

I investigated as to why the performance was poor earlier. Turns out, my GPU was not running in performance mode earlier.

view this post on Zulip Shivaraj B H (Apr 04 2024 at 11:39):

Andreas said:

Next thing I'd do for usability is add a catered list of models that are commented out by default (because let's be honest llama2-7b is not the more talkative buddy) and document that a bit. So users can choose what they want.

Along with that I should also document how to override cudaPackages or rocmPackages to match the one’s installed on the system.

What I tend to do is nix run github:shivaraj-bh/ollama-flake#cuda —override-input nixpkgs flake:nixpkgs. In my configuration, I have the registry flake:nixpkgs pinned to the same nixpkgs I use to install nvidia drivers for, so it should work out of the box. I will add this is as a possible solution.

view this post on Zulip Shivaraj B H (Apr 04 2024 at 11:41):

If they are on a non-nixos machine, then they will have to manually get the version of the drivers installed and use the compatible cudaPackages or rocmPackages

view this post on Zulip Andreas (Apr 04 2024 at 11:46):

Yes, all that sounds quite good. There is also ROCR_VISIBLE_DEVICES for GPU isolation.

view this post on Zulip Shivaraj B H (Apr 04 2024 at 11:51):

I believe it is HIP_VISIBLE_DEVICES, I don’t see ROCR_VISIBLE_DEVICES in ollama docs: https://github.com/ollama/ollama/blob/9768e2dc7574c36608bb04ac39a3b79e639a837f/docs/gpu.md?plain=1#L88-L93

view this post on Zulip Andreas (Apr 04 2024 at 11:53):

yeah maybe that one doesn't do much in our context.

AMD's docs are here for once: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html

(At least they have some in this case)

view this post on Zulip Shivaraj B H (Apr 04 2024 at 11:56):

Gotcha, will add that too!

view this post on Zulip Andreas (Apr 04 2024 at 11:57):

funny thing is, that according to CUDA_VISIBLE_DEVICES has the same effect as HIP_VISIBLE_DEVICES for compatibility reasons

view this post on Zulip Andreas (Apr 04 2024 at 12:19):

once you make another commit, I will test this again and see which flag does what in practice. You never know.

view this post on Zulip Shivaraj B H (Apr 04 2024 at 13:20):

done: https://github.com/shivaraj-bh/ollama-flake/commit/7b6375cc4849c6b3a91be5543043d3c820d312f6

view this post on Zulip Andreas (Apr 04 2024 at 18:05):

just trying this out on my Nvidia Laptop with 2GB of VRAM. It starts building ollama 0.1.29 from source. Any idea why this might be happening? I am also on NixOS stable there...

view this post on Zulip Andreas (Apr 04 2024 at 18:18):

so after building a bit, it actually works. I can run deepseek-coder:1.3-instruct on my crappy Nvidia MX 150.

And it gets me 27 tokens / sec. Not too bad for this older laptop GPU.

view this post on Zulip Andreas (Apr 04 2024 at 18:27):

I think it'd be nice to provide a list of models directly in flake.nix

view this post on Zulip Andreas (Apr 04 2024 at 18:27):

but I'll play with this setup a bit tomorrow for coding. Sadly it won't be very good at Nix.

view this post on Zulip Shivaraj B H (Apr 04 2024 at 18:28):

Andreas said:

just trying this out on my Nvidia Laptop with 2GB of VRAM. It starts building ollama 0.1.29 from source. Any idea why this might be happening? I am also on NixOS stable there...

If you are overriding with acceleration = “cuda”, it will build ollama from scratch. Although, I have noticed that it doesn’t do that if I use garnix cache.

view this post on Zulip Shivaraj B H (Apr 04 2024 at 18:29):

Andreas said:

I think it'd be nice to provide a list of models directly in flake.nix

I was thinking of linking to ollama.com/library

view this post on Zulip Shivaraj B H (Apr 04 2024 at 18:30):

Andreas said:

but I'll play with this setup a bit tomorrow for coding. Sadly it won't be very good at Nix.

Someone’s gotta train it with good content. Unfortunately there isn’t much. Hopefully with Nixos.asia tutorials, we can bridge that gap

view this post on Zulip Andreas (Apr 04 2024 at 18:31):

I mean my idea was to try and finetune something at some point. I am still stuck at the point of "What's the right data that I need for that?" ... because if I understand it correctly, most base models haven't see a whole lot of Nix code.

view this post on Zulip Shivaraj B H (Apr 04 2024 at 18:36):

Yup

view this post on Zulip Andreas (Apr 04 2024 at 19:12):

Getting back to the issue at hand: is there a good way passing an externally defined config file to the flake that would specify the models I would want to use? Otherwise I'd say it'll be nice to expose that list of models directly in flake.nix somehow.

view this post on Zulip Andreas (Apr 04 2024 at 19:13):

(I will be trying out the continue.dev extension in VSCode a bit more with this local A.I. and deepseek-coder:1.3b-instruct. So far it performs quite okay.)

view this post on Zulip Shivaraj B H (Apr 05 2024 at 11:36):

Andreas said:

Getting back to the issue at hand: is there a good way passing an externally defined config file to the flake that would specify the models I would want to use? Otherwise I'd say it'll be nice to expose that list of models directly in flake.nix somehow.

Nothing that comes to my mind right of the top, will give some thought to it over the weekend

view this post on Zulip Andreas (Apr 05 2024 at 11:41):

it just came up because with me not maintaining a fork of your repo, everytime I do git pull it obviously annoys me because of changes I made myself to the files. So how about reading the config of local network config and models from a TOML or YAML file that the user would have to create on his end and that is in .gitignore? You might provide a template in the repo for the user to adopt to their use case.

view this post on Zulip Shivaraj B H (Apr 05 2024 at 11:45):

Andreas said:

it just came up because with me not maintaining a fork of your repo, everytime I do git pull it obviously annoys me because of changes I made myself to the files. So how about reading the config of local network config and models from a TOML or YAML file that the user would have to create on his end and that is in .gitignore? You might provide a template in the repo for the user to adopt to their use case.

Yup, that makes sense

view this post on Zulip Shivaraj B H (Apr 08 2024 at 11:28):

macOS support added: https://github.com/shivaraj-bh/ollama-flake/pull/3

Next steps:

view this post on Zulip Andreas (Apr 08 2024 at 11:42):

Awesome!

view this post on Zulip Andreas (Apr 08 2024 at 11:42):

And I will bet on others being thankful for that possibility as well

view this post on Zulip Srid (Apr 09 2024 at 06:02):

Is this expected? (on macOS, after downloading the model for about 30 mins)

image.png

view this post on Zulip Shivaraj B H (Apr 09 2024 at 06:03):

I have to debug this on macOS, usually a restart of that process and then the open-browser process solves it

view this post on Zulip Srid (Apr 09 2024 at 06:03):

Alright, just a temporary issue. Re-running it worked

view this post on Zulip Shivaraj B H (Apr 09 2024 at 06:04):

I think it has to do with the initial_delay_seconds of the readiness_probe because uvicorn takes a while to start

view this post on Zulip Shivaraj B H (Apr 09 2024 at 06:09):

And also about the model pull taking long time, I am thinking of creating something like dockerPullImage for ollama models and serving it as a cache. I have ovserved that ollama pull starts off with a good bandwidth and it just goes to kbps in the end.

view this post on Zulip Shivaraj B H (Apr 09 2024 at 06:11):

In this way, I can also run flake checks, without requiring internet connection in sandbox mode

view this post on Zulip Shivaraj B H (Apr 09 2024 at 17:50):

@Andreas ollama-flake now exports a processComposeModule, see the examples: https://github.com/shivaraj-bh/ollama-flake/tree/main/example

view this post on Zulip Shivaraj B H (Apr 09 2024 at 17:51):

You can use them in any of your flake now

view this post on Zulip Shivaraj B H (Apr 09 2024 at 17:52):

I will be updating the README with relevant details too

view this post on Zulip Shivaraj B H (Apr 09 2024 at 17:57):

I want to make some design changes, like decoupling the open-webui service from ollama itself, allowing to configure them separately. I will deprioritise this for now, as I see a feature like dockerTools.pullImage for ollama models being more useful, I will research a bit on that next.

view this post on Zulip Shivaraj B H (Apr 09 2024 at 17:58):

like decoupling the open-webui service from ollama itself

This will also allow configuring multiple frontends in the future, enable and disable them as you like.

view this post on Zulip Andreas (Apr 09 2024 at 18:13):

awesome, looking very nice! I will give it a try tomorrow maybe or a but later...

view this post on Zulip Shivaraj B H (Apr 10 2024 at 08:19):

Got my PR merged to open-webui: https://github.com/open-webui/open-webui/pull/1472

This has helped in packaging the frontend and backend in one single derivation.

Earlier I had to do a lot of hacky stuff to workaround this:
Screenshot-2024-04-10-at-1.48.04PM.png

view this post on Zulip Shivaraj B H (Apr 10 2024 at 08:19):

Now it looks clean:
Screenshot-2024-04-10-at-1.48.46PM.png

view this post on Zulip Shivaraj B H (Apr 10 2024 at 08:24):

I will resume work on ollama-flake tonight, will solve some juspay/nix-health issues now.

view this post on Zulip Andreas (Apr 10 2024 at 09:22):

Beautiful! I have a feeling this will be the best ollama flake ever, period... once people know about it, that is. Maybe after it has good usability, you could post it somewhere to create publicity?

view this post on Zulip Shivaraj B H (Apr 19 2024 at 14:15):

So, I am back after a quick break. Before I left, I tried to create a derivation that would pull the model and cache it in /nix/store, just like dockerTools.pullImage does, but with docker images. Unfortunately, there is a blocker for this: https://github.com/ollama/ollama/issues/3369.

I will get back to this, once that issue is resolved (I tried to implement it for a bit, but it isn’t that straightforward). Looking forward to that.

For now, I will decouple the frontend service (open-webui) from the ollama backend service. Maybe also add another frontend service to it (to show how easily swap-able it can be), make a 0.1.0 release and announce it.

view this post on Zulip Andreas (Apr 19 2024 at 17:44):

it should be swap-able to some extent. Ollama has a long list of possible frontend service irrc. I am not sure that storing models in the nix store is a great strategy, as these can be fairly big, and you might want to add some from the ollama cli after the fact. Also you certainly would not want your 30 GB model to be garbage collected while still using it from time to time, so you'd have to prevent that somehow. In docker the image might be stored in the nix store without issues as it is immutable once built. But docker volumes and their content would not be, I suppose.

idk if I am right though, let me know what you think.

view this post on Zulip Andreas (Apr 19 2024 at 18:23):

also: how does this work right now? I kinda don't see the command and just checking nix flake show and nix flake info and looking into flake.nix didn't really tell me how to run the rocm service right now... I am in a LXC container right now, running some Debian 12 which doesn't want to give me podman on ZFS sadly, and I was trying to get this to work.

view this post on Zulip Shivaraj B H (Apr 20 2024 at 16:51):

Andreas said:

it should be swap-able to some extent. Ollama has a long list of possible frontend service irrc. I am not sure that storing models in the nix store is a great strategy, as these can be fairly big, and you might want to add some from the ollama cli after the fact. Also you certainly would not want your 30 GB model to be garbage collected while still using it from time to time, so you'd have to prevent that somehow. In docker the image might be stored in the nix store without issues as it is immutable once built. But docker volumes and their content would not be, I suppose.

idk if I am right though, let me know what you think.

You might be right, I haven’t fully evaluated this idea yet. Anyways, can’t do much until the issue I linked above from ollama is open.

view this post on Zulip Shivaraj B H (Apr 20 2024 at 16:52):

Andreas said:

also: how does this work right now? I kinda don't see the command and just checking nix flake show and nix flake info and looking into flake.nix didn't really tell me how to run the rocm service right now... I am in a LXC container right now, running some Debian 12 which doesn't want to give me podman on ZFS sadly, and I was trying to get this to work.

I have updated the README now, you can use the flake template. For you, it will look like:

mkdir my-ollama-flake && cd ./my-ollama-flake
nix flake init -t github:shivaraj-bh/ollama-flake#rocm
nix run

view this post on Zulip Shivaraj B H (Apr 20 2024 at 16:53):

Or if you already have an existing flake where you would like to integrate this into, you can look at the examples and grab the necessary pieces

view this post on Zulip Shivaraj B H (Apr 20 2024 at 17:18):

I decided to not depend on services-flake in ollama-flake (yet to update README). I was unnecessarily keeping only the ollama service in the former and rest everything in the latter. let’s keep it simple! I have decided to bundle everything related to ollama in this single repository (from the server, to frontends, also the CLI clients and so on).

Aside, I do have some ambitious plans for ollama-flake in the future, one of which being:

Provide a just generate-doc <service> command in services-flake. This command will run a process-compose app that will start the ollama server (configured by ollama-flake), run a CLI client (gotta find something like smartcat for ollama), provide the context of docs of other services and the tests for <service> and out comes the doc for <service>.

Let’s see how this idea fares

view this post on Zulip Andreas (Apr 20 2024 at 17:27):

So if I get this correctly you want to write tests and docs in an automated fashion?

What would be the input of <service>? Some git repo with an existing codebase?

view this post on Zulip Shivaraj B H (Apr 20 2024 at 17:29):

Yes, at the moment focused only on docs. <service> here could be one of many that we support in services-flake. For example, Postgres or MySQL or Redis and so on. We do have docs for Postgres, but not for MySQL and many other services.

view this post on Zulip Andreas (Apr 21 2024 at 10:59):

Alright... one thing I noticed that confused me is this: your flake is now not really a flake for providing ollama, but a set of flake templates for providing ollama. If this is to be permanent, maybe the repo should be renamed again?

view this post on Zulip Shivaraj B H (Apr 22 2024 at 06:12):

There are other repos that follow the same approach of providing flake-templates/flake-module, like services-flake, rust-flake, just-flake. I just went with the same naming convention. What do you think will be more appropriate?

view this post on Zulip Srid (Apr 28 2024 at 04:56):

FYI

https://discourse.nixos.org/t/ollama-cant-find-libstdc-so-6/44305

view this post on Zulip Shivaraj B H (Apr 28 2024 at 05:20):

I doubt he is using the Ollama package from nixpkgs, otherwise this shouldn’t occur.

view this post on Zulip Srid (Apr 28 2024 at 05:21):

Aside: TIL that NixOS has a service for ollama:

https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/services/misc/ollama.nix

view this post on Zulip Shivaraj B H (Apr 28 2024 at 05:39):

Yes, it just starts the server. Pulling the model is a manual process

view this post on Zulip Srid (Apr 29 2024 at 08:18):

Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B

https://news.ycombinator.com/item?id=40191723

view this post on Zulip Andreas (May 02 2024 at 16:52):

yeah I need to test ollama 0.1.33 actually

view this post on Zulip Srid (May 16 2024 at 05:31):

Looks like private-gpt uses ollama

view this post on Zulip Srid (May 16 2024 at 05:32):

It'd be interesting to be able to nix run it, and feed it local documents for querying.

The nixpkgs PR has a module that is NixOS only, obviously.

view this post on Zulip Shivaraj B H (May 16 2024 at 05:52):

This is cool, I was looking to add one more UI before I announce. This is a good candidate

view this post on Zulip Shivaraj B H (May 21 2024 at 20:25):

There is a package request for open-webui in nixpkgs, someone just referenced to what I have packaged in ollama-flake: https://github.com/NixOS/nixpkgs/issues/309567#issuecomment-2105940033

If a PR gets merged in nixpkgs for the above issue, we can switch over to that.

view this post on Zulip Shivaraj B H (May 22 2024 at 09:01):

Nice to see features getting added to nixpkgs inspired by ollama-flake: https://github.com/NixOS/nixpkgs/pull/313606

view this post on Zulip Shivaraj B H (May 25 2024 at 16:47):

Screenshot-2024-05-25-at-10.15.52PM.png

private-gpt in ollama-flake! I will share the nix command for you to try out in a bit, I am yet to push the commits.

view this post on Zulip Shivaraj B H (May 25 2024 at 16:49):

In the screenshot you are seeing llama3:8b querying on https://nixos.asia/en/blog/replacing-docker-compose

view this post on Zulip Shivaraj B H (May 25 2024 at 17:21):

There you go:

nix run "github:shivaraj-bh/ollama-flake/private-gpt?dir=example/private-gpt” --refresh

view this post on Zulip Shivaraj B H (May 25 2024 at 17:24):

works on Linux and macOS

view this post on Zulip Srid (May 25 2024 at 17:24):

Trying it out ...

Should be --refresh not -refresh.

view this post on Zulip Shivaraj B H (May 25 2024 at 17:24):

Thanks, updated

view this post on Zulip Srid (May 25 2024 at 17:25):

And should be next to nix, not at the end

view this post on Zulip Srid (May 25 2024 at 17:25):

image.png

view this post on Zulip Shivaraj B H (May 25 2024 at 17:25):

And should be next to nix, not at the end

Seems to work even at the end

view this post on Zulip Srid (May 25 2024 at 17:26):

s//"/g

view this post on Zulip Srid (May 25 2024 at 17:26):

Is the readiness check fail of concern here?

image.png

view this post on Zulip Srid (May 25 2024 at 17:27):

Alright, what should I do now? It is running.

view this post on Zulip Shivaraj B H (May 25 2024 at 17:27):

readiness thing happens on macOS, I believe you remember it happened with open-webui too

view this post on Zulip Srid (May 25 2024 at 17:27):

Srid said:

Is the readiness check fail of concern here?

image.png

Exit code -1, incidentally.

view this post on Zulip Shivaraj B H (May 25 2024 at 17:27):

I need to fix that, for now restarting the process works

view this post on Zulip Srid (May 25 2024 at 17:28):

Okay, I restarted it, now what?

image.png

view this post on Zulip Srid (May 25 2024 at 17:28):

I expected it to automatically open something in my browser, TBH

view this post on Zulip Shivaraj B H (May 25 2024 at 17:29):

the browser is available at 8001 port, I need to add the open-browser process to this

view this post on Zulip Shivaraj B H (May 25 2024 at 17:29):

Its a TODO

view this post on Zulip Srid (May 25 2024 at 17:29):

Yea, that would be helpful - especially if it is cross-platform (xdg-open vs open?)

view this post on Zulip Srid (May 25 2024 at 17:29):

So there are these two things to address before we have nix run ... just work?

view this post on Zulip Andreas (May 25 2024 at 17:30):

is this supposed to run on AMD / ROCm as well? I might try later...

view this post on Zulip Shivaraj B H (May 25 2024 at 17:30):

Yea, that would be helpful - especially if it is cross-platform (xdg-open vs open?)

open-webui already has this, I need to generalise it: https://github.com/shivaraj-bh/ollama-flake/blob/b61859956129b63fc6e2c8ad1ab4c8d13cc6cc96/services/open-webui.nix#L87-L97

view this post on Zulip Shivaraj B H (May 25 2024 at 17:30):

is this supposed to run on AMD / ROCm as well? I might try later…

Yes

view this post on Zulip Shivaraj B H (May 25 2024 at 17:31):

Srid said:

s///g

zulip does something with the text while copy pasting, which changed the quotation and also the “—“ became “-“

Edit: I can’t reproduce, maybe it was something else.

view this post on Zulip Srid (May 25 2024 at 17:37):

Uploaded it an Ikea invoice,

image.png

view this post on Zulip Srid (May 25 2024 at 17:37):

There's a broken image link, though.

Pointing to http://localhost:8001/file=/Users/srid/code/the-actualism-way/private_gpt/ui/avatar-bot.ico

(/Users/srid/code/the-actualism-way/ is the $PWD)

view this post on Zulip Srid (May 25 2024 at 17:38):

When I open that link, the text response is:

{"detail":"File not allowed: /Users/srid/code/the-actualism-way/private_gpt/ui/avatar-bot.ico."}

view this post on Zulip Srid (May 25 2024 at 17:38):

It is missing the ./data/ parent directory.

view this post on Zulip Shivaraj B H (May 25 2024 at 17:38):

Noted, 3 things to fix then!

view this post on Zulip Srid (May 25 2024 at 17:39):

But I can't find the .ico file in the ./data directory. So 4th thing?

view this post on Zulip Shivaraj B H (May 25 2024 at 17:39):

It is probably in the private-gpt’s source, might have to copy it from there.

Yes, it is: https://github.com/zylon-ai/private-gpt/blob/main/private_gpt/ui/avatar-bot.ico

view this post on Zulip Shivaraj B H (May 25 2024 at 17:41):

might have to copy it from there.

Or just point it to the /nix/store.. path

view this post on Zulip Shivaraj B H (May 25 2024 at 18:00):

Noted, 3 things to fix then!

Fixed one of them: https://github.com/shivaraj-bh/ollama-flake/commit/1c19aadfbb975b2f16f05163322b677d13201760

Now the browser should open as soon as private-gpt is healthy

view this post on Zulip Srid (May 25 2024 at 18:04):

Uploading https://www.foo.be/docs-free/social-architecture/book.pdf

Took maybe 30 seconds to process. But the results are underwhelming:

image.png

view this post on Zulip Srid (May 25 2024 at 18:05):

Also, must it always create $PWD/data? Why not ~/.ollama-flake/data?

view this post on Zulip Shivaraj B H (May 25 2024 at 18:05):

I think, that’s a better default, we can reuse the pre-loaded models

view this post on Zulip Shivaraj B H (May 25 2024 at 18:07):

Took maybe 30 seconds to process. But the results are underwhelming:

I am yet to play around with larger documents. Works fine with smaller one’s. I have tried one with 2000-3000 words.

view this post on Zulip Srid (May 25 2024 at 19:19):

This thing creates a tiktoken_cache directory in $PWD for some reason.

view this post on Zulip Andreas (May 25 2024 at 19:22):

I played a little with some RAG stuff in the open webui, and it was underwhelming as well. There must be some kind of better approach I haven't yet figured out. Maybe we should ask some other people around here who might know more?

view this post on Zulip Andreas (May 25 2024 at 19:23):

At least larger books didn't really work at all. But maybe having smaller documents is somewhat important.

view this post on Zulip Shivaraj B H (May 31 2024 at 20:47):

upstreaming open-webui to nixpkgs: https://github.com/NixOS/nixpkgs/pull/316248

view this post on Zulip Andreas (Jun 01 2024 at 07:00):

Very nice!


Last updated: Nov 15 2024 at 11:45 UTC