https://github.com/ollama/ollama
You can run it from nixpkgs. Run these two commands in separate terminals:
nix run nixpkgs#ollama serve
nix run nixpkgs#ollama run llama2-uncensored
Works on macOS, at least.
(The query I used to test the model is inspired by the one Musk used to demonstrate Grok)
Well it might work if ROCm works. Which it doesn't if I try. So I would probably have to build a shell for ROCm or something. Otherwise I will be using CPU...
In case anyone likes to try this on AMD, here is the command to create a nice running docker container on Linux:
docker run -d --privileged --device /dev/kfd -v ollama:/root/.ollama -p 11434:11434 -e OLLAMA_DEBUG=1 -e ROCR_VISIBLE_DEVICES="0" -e HSA_OVERRIDE_GFX_VERSION=10.3.0 --name ollama ollama/ollama:0.1.24-rocm
The HSA_OVERRIDE_GFX_VERSION=10.3.0 environment variable is only needed if your GPU is not offically supported by ROCm. Which is most consumer and workstation GPUs actually, except a handful. Depending on the GPU it might or might not work.
And then you can do a
docker exec -it ollama ollama run llama2
Or whatever model you want to run
@Andreas Does the docker container run do anything special compared to pure Nix instance (nix run ..
) above?
well it has ROCm installed obviously so that the GPU can use it. However that is just the container for ROCm they provide. Juding from the dockerfile in the ollama GitHub repo, they build them on the basis of some CentOS images provides by AMD.
These in turn have their own repo, which I presume to be this one: https://github.com/ROCm/ROCm-docker
This here might be the dockerfile for the base image on top of which ollama is compiling their GO binary:
https://github.com/ROCm/ROCm-docker/blob/master/dev/Dockerfile-centos-7-complete
I created a process-compose-flake-module
for ollama
, this is for a personal project that I am working on and here’s how easy it is to setup:
{
services.ollama = {
enable = true;
host = "100.71.49.133";
# TODO: adding more than one model breaks in shellcheck
models = [ "llama2-uncensored" ];
};
}
nix run ...
cmd?
I will upstream this to services-flake, it is right now in the private repo as a module.
More about the project soon.
Here it is: https://github.com/juspay/services-flake/pull/137
The nix run command is a bit long though:
nix run "github:juspay/services-flake/ollama?dir=test"#ollama --override-input services-flake github:juspay/services-flake/ollama
This starts up ollama server with no models though, let me use a smaller sample model in test.
https://ollama.com/library/tinyllama looks like a good candidate
Oh wait, I can’t really pull models in flake check because of sandbox mode. I think these models will have to be pre-fetched by another derivation.
A separate repo, that uses services-flake module, to provide a single nix run ..
to get some of these models up running, including a chat interface even, would be cool.
https://github.com/F1bonacc1/process-compose/issues/64 I can provide a single nix run command this, basically allowing users to interact with the chat process in the process-compose window
https://github.com/F1bonacc1/process-compose/issues/64#issuecomment-1974895517
I will try it out with 0.88.0
Curious if there's a chat bot web app that can interact with ollama
There is this on top of search result: https://github.com/HelgeSverre/ollama-gui
Let's ship it. Gonna be a rad demo of services-flake.
https://github.com/open-webui/open-webui there’s also this (much more popular), but looks like it is bloated with a lot of features.
I will give both the UIs a shot though
I will try this tomorrow, going to get some sleep now
https://github.com/NixOS/nixpkgs/pull/275448
Shivaraj B H said:
https://github.com/F1bonacc1/process-compose/issues/64#issuecomment-1974895517
I will try it out with 0.88.0
Update: I didn’t notice that the STDIN support is not there, so even with the current PTY support, this isn’t possible in the process-compose window. Gotta take the WebUI route, anyways, the webui demo will be much cooler.
Screenshot-2024-03-25-at-11.47.58PM.png
I have the UI running, it involved a lot of hacks to get it running because of the way open-webui uses the python backend.
nix run github:shivaraj-bh/nixify-ollama
Unsurprisingly it does not use the GPU on my machine. We'd probably have to add all the ROCm related components.
But other than that, nice work!
oh I was a bit too quick: webui cannot find any models on my end...
I faced this problem, I had to manually change the ollama server url and press on the refresh button. Will look for a way to fix this
Anyways, open-webui is a bit too much, I am looking for a simper UI, which is easier to setup
open-webui actually looks quite nice. I personally use my containerized ollama with other clients though.
maybe we can work at making this ollama flake compatible with ROCm at some point. Did you test it with CUDA?
GPU acceleration is next on my list after I get this UI to work out of the box.
I mean it should just pick up the GPU if available. It's just that I don't keep the ROCm libraries on my system if I don't need them specifically. So there is not much to be found.
Andreas said:
open-webui actually looks quite nice. I personally use my containerized ollama with other clients though.
Yes, it does. I only hate how I can’t configure things like disabling web auth. When I am using it in my private network, or for development, I don’t really need it.
I only hate how I can’t configure things like disabling web auth
Yes, it appears to have feature you might rather want for a somewhat larger deployment, like load balancing and such
oh I was a bit too quick: webui cannot find any models on my end…
Screenshot-2024-04-01-at-4.58.42PM.png
The problem appears to be that I am not prefixing the protocol before the IP, as soon as i do that, it starts working. I will fix it
I have also included an Up Next
section in README: https://github.com/shivaraj-bh/nixify-ollama/blob/main/README.md
To track what I will be doing next
Shivaraj B H said:
oh I was a bit too quick: webui cannot find any models on my end…
Screenshot-2024-04-01-at-4.58.42PM.png
The problem appears to be that I am not prefixing the protocol before the IP, as soon as i do that, it starts working. I will fix it
I have fixed it, select model should now work without requiring any hacks
Yes, it's working now. Just a bit slow for my taste on CPU.
but we are getting there
I might look and see if I can get it to run with ROCm. So far I am only using containers for ROCm. But the stuff should all be in nixpkgs.
Yes, I will continue working tomorrow. I am going to hit the sack for the day
yeah it's somewhat late already where you are :smile:
I decided to create an issue on your repo for ROCm support @Shivaraj B H
Yup, I was checking out macOS support today, while I was in parallel setting up my new machine with Nvidia drivers to try out GPU acceleration
I don’t have AMD hardware to test out ROCm, but I suppose I can test it on some cheap cloud instances
that is nice, maybe we can adapt the Nvidia settings to run AMD stuff. Nice thing about ROCm is that it's just running on top of the AMDGPU kernel drivers.
Shivaraj B H said:
I don’t have AMD hardware to test out ROCm, but I suppose I can test it on some cheap cloud instances
Or I'd be the one testing it.
Is there an easy way to export an environment variable depending on the user's GPU architecture?
Also, since I am on a multi GPU system, I'd need a way to set the GPU that ollama is to be using. In docker that is fairly easy to do. I wonder how to do this here.
Andreas said:
Is there an easy way to export an environment variable depending on the user's GPU architecture?
What would this environment variable be used for?
Andreas said:
Also, since I am on a multi GPU system, I'd need a way to set the GPU that ollama is to be using. In docker that is fairly easy to do. I wonder how to do this here.
How does one do this in docker? Multi-gpu scenario is new to me as well, so gotta play around with it a bit
Yes, so on the env variable front most user will have to set HSA_OVERRIDE_GFX_VERSION=10.3.0
if they are on RDNA2 or RDNA arch GPUs. On RDNA3 this will be a different value. And it might not be needed if the GPU is officially supported by ROCm (which only very few are outside the data center).
For selecting the second GPU in my machine, I pass --device /dev/kfd
and --device /dev/dri/renderD129
to the container. While --device /dev/dri/renderD128
would select the first GPU.
In addition I set the ROCR_VISIBLE_DEVICES
to either 0
or 1
, which, I think, is there for isolating the GPUs.
Does ollama detect the GPU correctly inside a docker container?
yes, no issues. I am using ollama's own container images. They are probably built on top of AMD's official images for ROCm.
I think a big benefit of having this running via Nix would be a reduced footprint compared to these somewhat massive ROCm container images.
ollama/ollama 0.1.24-rocm 9d567aacf463 7 weeks ago 20.9GB
That is the last image I pulled. Smaller might be better :grinning_face_with_smiling_eyes:
I might have to look at how the container images are built, but I am quite certain it should be possible on native as well.
it should be, yes
I have just seen that they added a bit more documentation to their images:
https://hub.docker.com/r/bergutman/ollama-rocm
ah no, this is not the offical one
I guess this should be the official dockerfile:
https://github.com/ollama/ollama/blob/main/Dockerfile
Andreas said:
ollama/ollama 0.1.24-rocm 9d567aacf463 7 weeks ago 20.9GB
That is the last image I pulled. Smaller might be better :grinning_face_with_smiling_eyes:
Damn, I just noticed the image size is 20GB
yes, and since I already have multiple such images for different apps, you can see how that can easily become a bit cumbersome. Let's just say the image size is one of the many places AMD's ROCm could use some optimization.
But I am happy that is working more or less nicely out of the box now. That is already a big thing. So they might get there eventually.
It might just be very simple to enable ROCm, I am trying something out with CUDA, if that works out, I will post here and you can tell me if it works for you
the only CUDA device I have is my laptop GPU with 2GB of VRAM. So you'd need to set a very small model for that. (Btw. defining models more dynamically might also be a nice add-on for this)
but once you have a CUDA implementation going, I can see if I can derive a ROCm implementation from that one
Shivaraj B H said:
It might just be very simple to enable ROCm, I am trying something out with CUDA, if that works out, I will post here and you can tell me if it works for you
my bad, I wasn’t clear here.
This is what I am trying for CUDA:
services.ollama."ollama" = {
enable = true;
package = pkgs.ollama.override { acceleration = "cuda"; };
host = "0.0.0.0";
models = [ "llama2-uncensored" ];
};
What I want you to try is:
services.ollama."ollama" = {
enable = true;
package = pkgs.ollama.override { acceleration = “rocm"; };
host = "0.0.0.0";
models = [ "llama2-uncensored" ];
};
okay, but for that I'd need to actually provide the ROCm libraries somewhere, right? Otherwise there is nothing to pick up. Also ollama should detect that automatically, if it has been compiled with ROCm support. At least I believe it should.
but for that I'd need to actually provide the ROCm libraries somewhere, right?
nixpkgs does it for you: https://github.com/NixOS/nixpkgs/blob/8a22284f51fcd7771ee65ba124175bf9b90505ad/pkgs/tools/misc/ollama/default.nix#L51-L62
package = pkgs.ollama.override { acceleration = “rocm"; };
when you use the above override it will build ollama again with rocm libraries in its env, atleast that is what I believe it does.
alright, I will clone the repo and try if that works tomorrow
after the override for cuda, ollama detected the libraries:
(from the logs of ollama serve
)
time=2024-04-02T20:07:05.102Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/nix/store/ikmrvm9hp2j7mka3li49cycx8mzbw080-nvidia-x11-550.67-6.6.23/lib/libnvidia-ml.so.550.67]"
but it fails at a later stage:
nvmlInit_v2 err: 18
time=2024-04-02T20:07:05.106Z level=INFO source=gpu.go:249 msg="Unable to load CUDA management library /nix/store/ikmrvm9hp2j7mka3li49cycx8mzbw080-nvidia-x11-550.67-6.6.23/lib/libnvidia-ml.so.550.67: nvml vram init failure: 18"
I will investigate this tomorrow
I am suspecting now already that for ROCm you might need other components too. Like rocmPackages.rocm-runtime
for instance, or rocmPackages.rocm-smi
maybe. Not sure.
Yup, let’s check that out tomorrow. I will head out for today
/me hasn't had a chance check this out because he is yet to find a reliable internet connection in Australia
macOS support is still a WIP for the open-webui backend, works well on linux for now. It is because the lock.json only locks for one platform atm. I have to switch to pdm for package management instead of pip, ensuring multi-platform support.
Shivaraj B H said:
after the override for cuda, ollama detected the libraries:
(from the logs ofollama serve
)time=2024-04-02T20:07:05.102Z level=INFO source=gpu.go:237 msg="Discovered GPU libraries: [/nix/store/ikmrvm9hp2j7mka3li49cycx8mzbw080-nvidia-x11-550.67-6.6.23/lib/libnvidia-ml.so.550.67]”
but it fails at a later stage:
nvmlInit_v2 err: 18 time=2024-04-02T20:07:05.106Z level=INFO source=gpu.go:249 msg="Unable to load CUDA management library /nix/store/ikmrvm9hp2j7mka3li49cycx8mzbw080-nvidia-x11-550.67-6.6.23/lib/libnvidia-ml.so.550.67: nvml vram init failure: 18”
I will investigate this tomorrow
@Andreas turns out it was just incompatibility between the driver version installed on my system vs the respective library exposed to ollama by the override. I matched them and now it works :tada:
The speed difference is huge, like 20x. On my CPU I get about 10-11 tokens/second and with RTX 3070 8GB vram mobile GPU, I get about 200-210 tokens/second
Wait, The token per second is for prompt eval, generation eval is much slower for both of them. For CPU it is 3.9 tokens/second and for GPU it is 24.7 tokens/seconds, about 10x better on GPU.
yes, I will try later today. Where can you see the generation speed btw.?
but the huge speed increase is to be expected. That's why I wanted GPU to work so badly. :grinning_face_with_smiling_eyes:
Andreas said:
yes, I will try later today. Where can you see the generation speed btw.?
You can find it in the logs of ollama serve
process after a request is complete. I am thinking of exposing a benchmark script that can be run to compare numbers on different hardware.
ah so it's in the logs. I never bothered to look at them so far. But I'd like to compare my two AMD cards with your Nvidia 3070.
Also speed will depend on the model I guess. You are still using llama-7b-uncensored, right?
Andreas said:
ah so it's in the logs. I never bothered to look at them so far. But I'd like to compare my two AMD cards with your Nvidia 3070.
Also speed will depend on the model I guess. You are still using llama-7b-uncensored, right?
Yup, I am using llama-7b-uncensored. I tried with llama-70b-uncensored on my GPU, it is extremely slow, like 0.3-0.5 tokens/second
Didn’t even bother trying on CPU
I will start playing around in the night today, after I am done with work
Actually you did bother trying on CPU. :big_smile:
Because ollama most likely did fall back to CPU as you ran out of VRAM with your 8 GB.
but for smaller models, there are quite a few capable models for code-related tasks that fit into 8GB of VRAM. And even a good amount that might fit my 2 GB of VRAM on my somewhat older laptop.
Because ollama most likely did fall back to CPU as you ran out of VRAM with your 8 GB.
Right, that makes sense.
E972A83F-3126-43BE-B30F-BC656E551C9A.jpg
Now, I can access it on my phone as well
I am right now trying this snippet of code:
services.ollama."ollama" = {
enable = true;
package = pkgs.ollama.override { acceleration = “rocm"; };
host = "0.0.0.0";
models = [ "llama2-uncensored" ];
};
However the result is that Nix wants to compile rocblas-6.0.2 from source... or at least that what my CPU fans sound like. Not ideal. I tried compiling the ROCm stack from source once. I have a Ryzen 3900x, but even with that machine this will take several hours, depending on how much it needs to compile.
That being said, how can I specify runtime dependencies for the service I am declaring? Because I have a feeling I might need some, as I do not have ROCm runtime packages available on my system by default.
I will let it do it's building and see what happens (most likely it will not work, since... ROCm)
However the result is that Nix wants to compile rocblas-6.0.2 from source
assuming you are on x86_64-linux, it should be coming from cache: https://hydra.nixos.org/build/254389608
that would make sense, however it did not
right now it is stuck for some reason for the last 10 min so telling me:
[1/1/10 built, 57 copied (9806.3/9806.5 MiB), 1089.1 MiB DL] building rocblas-6.0.2 (buildPhase): 4 warnings generated when compiling for gfx942.
I will wait a bit longer and see if it back up on its feet again
I will let it do it's building and see what happens (most likely it will not work, since... ROCm)
The only problem I would see is the mismatch of versions between drivers installed on your OS with that of rocmPackages in the nixpkgs commit we are using in nixify-ollama’s flake.nix
Andreas said:
right now it is stuck for some reason for the last 10 min so telling me:
[1/1/10 built, 57 copied (9806.3/9806.5 MiB), 1089.1 MiB DL] building rocblas-6.0.2 (buildPhase): 4 warnings generated when compiling for gfx942.
I will wait a bit longer and see if it back up on its feet again
It definitely is building from scratch, not a good sign.
mismatch of versions between drivers installed on your OS
The drivers are the Linux kernel drivers for my kernel version. Which is the LTS kernel 6.1.82 right now.
It definitely is building from scratch, not a good sign.
Might this be due to the ollama package being overridden in the nixify flake?
Might this be due to the ollama package being overridden in the nixify flake?
That should rebuild only the ollama package and not the rocmPackages
okay, my base system is on NixOS stable, not unstable. Would that have an impact here?
Andreas said:
okay, my base system is on NixOS stable, not unstable. Would that have an impact here?
probably, let’s wait for it to build. In any case, you can override rocmPackages as well (to match your system’s stable release), so not a problem
since it's not moving anywhere, I decided to cancel it now
yes, that is a bit of a sad state if Nix wants to rebuild stuff from scratch
but since your flake itself uses nixpkgs unstable, normally it should pull in the binaries nonetheless, right?
okay, so now I decided to downgrade your flake to the NixOS-23.11 branch, and now it is only compiling ollama itself.
okay, so it compiled, but the webui now gets me an internal server error when trying to talk to model.
When I do a
curl http://localhost:11434/api/generate -d '{
"model": "llama2-uncensored",
"prompt":"Why is the sky blue?"
}'
I get curl: (52) Empty reply from server
when I run that on my docker llama2, there is no issue, I get a stream of individual response tokens
how would I set the environment for the process compose flake services? I'd say passing OLLAMA_DEBUG=1
might help find out what isn't moving there...
As for stats, I get this on my GPU:
[1712166751] print_timings: prompt eval time = 25.77 ms / 0 tokens ( inf ms per token, 0.00 tokens per second)
[1712166751] print_timings: eval time = 2132.53 ms / 118 runs ( 18.07 ms per token, 55.33 tokens per second)
[1712166751] print_timings: total time = 2158.30 ms
no idea if this is good or bad. But the response comes pretty fast I'd say
I’d need to expose an option to add extraEnvs
and this is from my secondary GPU
[1712167117] print_timings: prompt eval time = 95.26 ms / 28 tokens ( 3.40 ms per token, 293.94 tokens per second)
[1712167117] print_timings: eval time = 5266.83 ms / 212 runs ( 24.84 ms per token, 40.25 tokens per second)
[1712167117] print_timings: total time = 5362.09 ms
These stats are from within the docker, is it?
What GPUs are you running on your machine?
Yeah the stats are from within the docker containers
Great idea having some env to set. Let's see how you implement it :smile:
Is there a way to attach to the tty of the running process in process compose? That way I could see what ollama spits out in debug messages
Shivaraj B H said:
What GPUs are you running on your machine?
1) Radeon Pro W6800
2) Radeon Pro W6600
You could do “nix run — -t=false” to disable tui mode in process-compose
So you 3070 was getting 25 tokens per second, yes? I find it odd that the W6600 should be that much faster with 40 tokens per second.
Andreas said:
Shivaraj B H said:
What GPUs are you running on your machine?
1) Radeon Pro W6800
2) Radeon Pro W6600
That’s 32 gb and 8gb vram respectively?
Andreas said:
So you 3070 was getting 25 tokens per second, yes? I find it odd that the W6600 should be that much faster with 40 tokens per second.
Does that depend on the prompt maybe? I will try giving the same prompt as you
And is your GPU over clocked by any chance?
that is possible, I was just trying the prompt via curl that ollama has on github. Just with the different model
curl http://localhost:11434/api/generate -d '{
"model": "llama2-uncensored",
"prompt":"Why is the sky blue?"
}'
Shivaraj B H said:
And is your GPU over clocked by any chance?
Nope, rather underclocked if compared to the gaming variant. The TDP for the Radeon Pro W6600 is basically locked at around 100W, it's s one slot card, which is very nice. The same goes for the W6800, which is the equivalent of the RX 6800, but uses less power. It does however have twice the VRAM.
When it comes to gaming your 3070 should be rather close to my W6800 I'd say.
Shivaraj B H said:
You could do “nix run — -t=false” to disable tui mode in process-compose
copy pasting this will only get me error: unrecognised flag '-t'
Andreas said:
Shivaraj B H said:
You could do “nix run — -t=false” to disable tui mode in process-compose
copy pasting this will only get me
error: unrecognised flag '-t’
My bad, it is nix run .#default -- -t=false
(Yes that was me being stupid / lazy I guess)
okay let's see... there are quite a few errors coming from ollama
Also, I have implemented the extraEnvs
option. You can pull the latest changes on nixify-ollama
and use this:
{
services.ollama."ollama" = {
enable = true;
host = "0.0.0.0";
models = [ "llama2-uncensored" ];
extraEnvs = {
OLLAMA_DEBUG="1";
};
};
}
and at some point among the gigantic mess we see this:
[ollama ] CUDA error: invalid device function
[ollama ] current device: 0, in function ggml_cuda_op_flatten at /build/source/llm/llama.cpp/ggml-cuda.cu:10012
[ollama ] hipGetLastError()
[ollama ] GGML_ASSERT: /build/source/llm/llama.cpp/ggml-cuda.cu:256: !"CUDA error"
[ollama ] loading library /tmp/ollama1196279674/rocm/libext_server.so
[ollama ] SIGABRT: abort
[ollama ] PC=0x7f95e13c3ddc m=10 sigcode=18446744073709551610
[ollama ] signal arrived during cgo execution
My assumption is that I need to set a env var to get over this. So I will pull and try again.
Success!!!
services.ollama."ollama" = {
enable = true;
package = pkgs.ollama.override { acceleration = "rocm"; };
host = "0.0.0.0";
models = [ "llama2-uncensored" ];
extraEnvs = {
HSA_OVERRIDE_GFX_VERSION = "10.3.0";
OLLAMA_DEBUG = "1";
};
If I ask it about the Roman empire, I still get 56 tokens per second on the W6800
sometimes to goes down to 48-49 tokens per second. but the model does not like to give long responses it seems
but at least it is running now
Andreas said:
If I ask it about the Roman empire, I still get 56 tokens per second on the W6800
Is that the same in docker with GPU acceleration?
Andreas said:
Success!!!
services.ollama."ollama" = { enable = true; package = pkgs.ollama.override { acceleration = "rocm"; }; host = “0.0.0.0”; models = [ "llama2-uncensored” ]; extraEnvs = { HSA_OVERRIDE_GFX_VERSION = “10.3.0”; OLLAMA_DEBUG = “1”; };
Great, will star these messages to document them later.
Shivaraj B H said:
Andreas said:
If I ask it about the Roman empire, I still get 56 tokens per second on the W6800
Is that the same in docker with GPU acceleration?
I guess so, more or less
I mean we could try and think how to structure the flake so that you can run with different GPU architectures.
I mean we could try and think how to structure the flake so that you can run with different GPU architectures.
I will be providing two different nix runnable apps, nix run .#with-cuda
and nix run .#with-rocm
. This is what I am thinking for now
I'd just go for #cuda
and #rocm
... and I think I'd also really like to disable the webui.
webui, I am thinking only for the default app. Can disable it for others
let me add cuda
and rocm
really quick
I'd say let me (or the consumer) choose to enable to disable.
In the future, I am also planning to export a home-manager module that will support systemd config for linux and launchd config for mac. To allow running ollama server in the background
I mean there is already a nixos module, isn't there?
Yes, I can take inspiration from it, but can’t reuse it in macos or other linux distros
yes, that is true.
Added support for cuda
and rocm
:
{
default = {
imports = [ common ];
services.ollama-stack.open-webui.enable = true;
};
cuda = {
imports = [ common ];
services.ollama-stack.extraOllamaConfig = {
package = pkgs.ollama.override { acceleration = "cuda"; };
};
};
rocm = {
imports = [ common ];
services.ollama-stack.extraOllamaConfig = {
package = pkgs.ollama.override { acceleration = "rocm"; };
};
};
}
Above is the definition for three apps, you can enable or disable open-webui on any of them. You can even pass extraOllamaConfig
, I didn’t add the HSA_OVERRIDE_GFX_VERSION
yet because I want to understand more about why it is needed and how to determine the version
I have a Hetzner dedicated server (x86_64-linux) now; how do I try this out?
Srid said:
I have a Hetzner dedicated server (x86_64-linux) now; how do I try this out?
nix run github:shivaraj-bh/nixify-ollama
Shivaraj B H said:
Srid said:
I have a Hetzner dedicated server (x86_64-linux) now; how do I try this out?
nix run github:shivaraj-bh/nixify-ollama
And then to test it:
curl http://<hetzner-machine-ip>:11434/api/generate -d '{
"model": “llama2-uncensored",
"prompt": "Why is the sky blue?",
"stream": false
}'
or
open http://<hetzner-machine-ip>:1111
to open open-webui
What does the curl command do exactly? Why can't I type that prompt in the webui itself?
(I have firewall setup so this means I'd have to setup port forwarding for 1111 the webui, which is fine, but why should I have to do it for 11434?)
@Shivaraj B H
On the HSA_OVERRIDE_GFX_VERSION
variable. I didn't find any documentation on it. However, the whole ROCm stack can be compiled for different LLVM targets, which you can find here:
https://llvm.org/docs/AMDGPUUsage.html (there are quite a few recent ones not documented there, no idea why not)
You will see that gfx1030
corresponds to my Radeon Pro W6800 (which is basically the same as any consumer RX 6800). This is where I got the number from. Now usufally that should not be necessary for my card as it is one of the few which has official support. However as we have seen, it still is necessary to override this. As it will be for any other RDNA2 based card, and probably RDNA1. Basically you fake having a Radeon RX 6800 architecture to ROCm so that it runs anyways. Which works just fine on my W6600 for instance.
If you have a more recent RDNA3 based card, you will need a different override. Most likely SA_OVERRIDE_GFX_VERSION=11.0.0
Srid said:
What does the curl command do exactly? Why can't I type that prompt in the webui itself?
You can do the same thing from webui as well
Srid said:
(I have firewall setup so this means I'd have to setup port forwarding for 1111 the webui, which is fine, but why should I have to do it for 11434?)
only 1111 should suffice
Should I sign up?
Srid said:
What does the curl command do exactly? Why can't I type that prompt in the webui itself?
The curl command is just talking to ollama's API directly. That API is running on port 11434. Which the webui is most likely talking to as well.
Srid said:
Should I sign up?
Yes, I was looking for ways to get rid of it. But open-webui has it hardcoded. You can signup with any dummy mail
As a "noob user" looking to explore this, I'd just want to do the minimal thing necessary to get this up and running. Would be good to document it in README
Okay I created an account. I suppose it stores it in local DB?
Nice! :tada:
Srid said:
As a "noob user" looking to explore this, I'd just want to do the minimal thing necessary to get this up and running. Would be good to document it in README
Yup, I was looking for simpler alternatives
Srid said:
Okay I created an account. I suppose it stores it in local DB?
Yes
This would be a great example for services-flake.
If you port forward 11434 you can use the Enchanted ios client to get a nice looking mobile and desktop app as well.
So @Shivaraj B H you broke ROCm again because the HSA_OVERRIDE_GFX_VERSION
variable isn't set... how do you propose to set it so I don't have to walk into ./nix/ollama.nix
?
(Maybe we do that tomorrow :big_smile: I'll check out for today!)
Andreas said:
So Shivaraj B H you broke ROCm again because the
HSA_OVERRIDE_GFX_VERSION
variable isn't set... how do you propose to set it so I don't have to walk into./nix/ollama.nix
?
You can add the envs here: https://github.com/shivaraj-bh/nixify-ollama/blob/017dca208fbec393f8c5c6b574c1c1234df176ce/flake.nix#L70-L72
The repo is now renamed to ollama-flake: https://github.com/shivaraj-bh/ollama-flake/issues/2
@Andreas I get about 77 tokens/second on the “why is the sky blue?” prompt, with GPU:
[ollama ] {"function":"print_timings","level":"INFO","line":257,"msg":"prompt eval time = 62.70 ms / 28 tokens ( 2.24 ms per token, 446.56 t
okens per second)","n_prompt_tokens_processed":28,"n_tokens_second":446.5638506562894,"slot_id":0,"t_prompt_processing":62.701,"t_token":2.2393214285714285,"
task_id":0,"tid":"139803154691776","timestamp":1712181596}
[ollama ] {"function":"print_timings","level":"INFO","line":271,"msg":"generation eval time = 1717.36 ms / 133 runs ( 12.91 ms per token, 77.44 t
okens per second)","n_decoded":133,"n_tokens_second":77.44463000100154,"slot_id":0,"t_token":12.91245112781955,"t_token_generation":1717.356,"task_id":0,"tid
":"139803154691776","timestamp":1712181596}
I have a feeling you should have some commented out env vars right in the main flake.nix
... useability.
Yes I can get the 55 token / second I reported on the W6800. What you got now is more what I expected. Nvidia should be faster at this. Also because CUDA will be more otimized than ROCm. We might try ROCm 6.0 at some point and see if this improves performance on AMD, because right now this is ROCm 5.7 I am running.
Also funny effect I had: ollama wouldn't shut down properly some .ollama-unwrap
process got stuck, didn't release the bound port 11434 and even on reboot systemd had quite some work to do to get the process killed. Not sure what happened there...
Next thing I'd do for usability is add a catered list of models that are commented out by default (because let's be honest llama2-7b is not the more talkative buddy) and document that a bit. So users can choose what they want.
I have a feeling you should have some commented out env vars right in the main flake.nix ... useability.
Right, the primary one’s I would see people using is CUDA_VISIBLE_DEVICES
HIP_VISIBLE_DEVICES
and of course HSA_OVERRIDE_GFX_VERSION
What you got now is more what I expected
I investigated as to why the performance was poor earlier. Turns out, my GPU was not running in performance mode earlier.
Andreas said:
Next thing I'd do for usability is add a catered list of models that are commented out by default (because let's be honest llama2-7b is not the more talkative buddy) and document that a bit. So users can choose what they want.
Along with that I should also document how to override cudaPackages
or rocmPackages
to match the one’s installed on the system.
What I tend to do is nix run github:shivaraj-bh/ollama-flake#cuda —override-input nixpkgs flake:nixpkgs
. In my configuration, I have the registry flake:nixpkgs
pinned to the same nixpkgs I use to install nvidia drivers for, so it should work out of the box. I will add this is as a possible solution.
If they are on a non-nixos machine, then they will have to manually get the version of the drivers installed and use the compatible cudaPackages
or rocmPackages
Yes, all that sounds quite good. There is also ROCR_VISIBLE_DEVICES
for GPU isolation.
I believe it is HIP_VISIBLE_DEVICES
, I don’t see ROCR_VISIBLE_DEVICES
in ollama docs: https://github.com/ollama/ollama/blob/9768e2dc7574c36608bb04ac39a3b79e639a837f/docs/gpu.md?plain=1#L88-L93
yeah maybe that one doesn't do much in our context.
AMD's docs are here for once: https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html
(At least they have some in this case)
Gotcha, will add that too!
funny thing is, that according to CUDA_VISIBLE_DEVICES
has the same effect as HIP_VISIBLE_DEVICES
for compatibility reasons
once you make another commit, I will test this again and see which flag does what in practice. You never know.
done: https://github.com/shivaraj-bh/ollama-flake/commit/7b6375cc4849c6b3a91be5543043d3c820d312f6
just trying this out on my Nvidia Laptop with 2GB of VRAM. It starts building ollama 0.1.29 from source. Any idea why this might be happening? I am also on NixOS stable there...
so after building a bit, it actually works. I can run deepseek-coder:1.3-instruct
on my crappy Nvidia MX 150.
And it gets me 27 tokens / sec. Not too bad for this older laptop GPU.
I think it'd be nice to provide a list of models directly in flake.nix
but I'll play with this setup a bit tomorrow for coding. Sadly it won't be very good at Nix.
Andreas said:
just trying this out on my Nvidia Laptop with 2GB of VRAM. It starts building ollama 0.1.29 from source. Any idea why this might be happening? I am also on NixOS stable there...
If you are overriding with acceleration = “cuda”, it will build ollama from scratch. Although, I have noticed that it doesn’t do that if I use garnix cache.
Andreas said:
I think it'd be nice to provide a list of models directly in flake.nix
I was thinking of linking to ollama.com/library
Andreas said:
but I'll play with this setup a bit tomorrow for coding. Sadly it won't be very good at Nix.
Someone’s gotta train it with good content. Unfortunately there isn’t much. Hopefully with Nixos.asia tutorials, we can bridge that gap
I mean my idea was to try and finetune something at some point. I am still stuck at the point of "What's the right data that I need for that?" ... because if I understand it correctly, most base models haven't see a whole lot of Nix code.
Yup
Getting back to the issue at hand: is there a good way passing an externally defined config file to the flake that would specify the models I would want to use? Otherwise I'd say it'll be nice to expose that list of models directly in flake.nix
somehow.
(I will be trying out the continue.dev extension in VSCode a bit more with this local A.I. and deepseek-coder:1.3b-instruct. So far it performs quite okay.)
Andreas said:
Getting back to the issue at hand: is there a good way passing an externally defined config file to the flake that would specify the models I would want to use? Otherwise I'd say it'll be nice to expose that list of models directly in
flake.nix
somehow.
Nothing that comes to my mind right of the top, will give some thought to it over the weekend
it just came up because with me not maintaining a fork of your repo, everytime I do git pull
it obviously annoys me because of changes I made myself to the files. So how about reading the config of local network config and models from a TOML or YAML file that the user would have to create on his end and that is in .gitignore
? You might provide a template in the repo for the user to adopt to their use case.
Andreas said:
it just came up because with me not maintaining a fork of your repo, everytime I do
git pull
it obviously annoys me because of changes I made myself to the files. So how about reading the config of local network config and models from a TOML or YAML file that the user would have to create on his end and that is in.gitignore
? You might provide a template in the repo for the user to adopt to their use case.
Yup, that makes sense
macOS support added: https://github.com/shivaraj-bh/ollama-flake/pull/3
Next steps:
processComposeModule
for @Andreas to use ollama-flake without having to clone and pull my repo all the time (I will also include example directory for how to use the module)Awesome!
And I will bet on others being thankful for that possibility as well
Is this expected? (on macOS, after downloading the model for about 30 mins)
I have to debug this on macOS, usually a restart of that process and then the open-browser process solves it
Alright, just a temporary issue. Re-running it worked
I think it has to do with the initial_delay_seconds of the readiness_probe because uvicorn takes a while to start
And also about the model pull taking long time, I am thinking of creating something like dockerPullImage for ollama models and serving it as a cache. I have ovserved that ollama pull starts off with a good bandwidth and it just goes to kbps in the end.
In this way, I can also run flake checks, without requiring internet connection in sandbox mode
@Andreas ollama-flake now exports a processComposeModule, see the examples: https://github.com/shivaraj-bh/ollama-flake/tree/main/example
You can use them in any of your flake now
I will be updating the README with relevant details too
I want to make some design changes, like decoupling the open-webui service from ollama itself, allowing to configure them separately. I will deprioritise this for now, as I see a feature like dockerTools.pullImage
for ollama models being more useful, I will research a bit on that next.
like decoupling the open-webui service from ollama itself
This will also allow configuring multiple frontends in the future, enable and disable them as you like.
awesome, looking very nice! I will give it a try tomorrow maybe or a but later...
Got my PR merged to open-webui: https://github.com/open-webui/open-webui/pull/1472
This has helped in packaging the frontend and backend in one single derivation.
Earlier I had to do a lot of hacky stuff to workaround this:
Screenshot-2024-04-10-at-1.48.04PM.png
Now it looks clean:
Screenshot-2024-04-10-at-1.48.46PM.png
I will resume work on ollama-flake tonight, will solve some juspay/nix-health issues now.
Beautiful! I have a feeling this will be the best ollama flake ever, period... once people know about it, that is. Maybe after it has good usability, you could post it somewhere to create publicity?
So, I am back after a quick break. Before I left, I tried to create a derivation that would pull the model and cache it in /nix/store
, just like dockerTools.pullImage
does, but with docker images. Unfortunately, there is a blocker for this: https://github.com/ollama/ollama/issues/3369.
I will get back to this, once that issue is resolved (I tried to implement it for a bit, but it isn’t that straightforward). Looking forward to that.
For now, I will decouple the frontend service (open-webui) from the ollama backend service. Maybe also add another frontend service to it (to show how easily swap-able it can be), make a 0.1.0
release and announce it.
it should be swap-able to some extent. Ollama has a long list of possible frontend service irrc. I am not sure that storing models in the nix store is a great strategy, as these can be fairly big, and you might want to add some from the ollama
cli after the fact. Also you certainly would not want your 30 GB model to be garbage collected while still using it from time to time, so you'd have to prevent that somehow. In docker the image might be stored in the nix store without issues as it is immutable once built. But docker volumes and their content would not be, I suppose.
idk if I am right though, let me know what you think.
also: how does this work right now? I kinda don't see the command and just checking nix flake show
and nix flake info
and looking into flake.nix
didn't really tell me how to run the rocm service right now... I am in a LXC container right now, running some Debian 12 which doesn't want to give me podman on ZFS sadly, and I was trying to get this to work.
Andreas said:
it should be swap-able to some extent. Ollama has a long list of possible frontend service irrc. I am not sure that storing models in the nix store is a great strategy, as these can be fairly big, and you might want to add some from the
ollama
cli after the fact. Also you certainly would not want your 30 GB model to be garbage collected while still using it from time to time, so you'd have to prevent that somehow. In docker the image might be stored in the nix store without issues as it is immutable once built. But docker volumes and their content would not be, I suppose.idk if I am right though, let me know what you think.
You might be right, I haven’t fully evaluated this idea yet. Anyways, can’t do much until the issue I linked above from ollama is open.
Andreas said:
also: how does this work right now? I kinda don't see the command and just checking
nix flake show
andnix flake info
and looking intoflake.nix
didn't really tell me how to run the rocm service right now... I am in a LXC container right now, running some Debian 12 which doesn't want to give me podman on ZFS sadly, and I was trying to get this to work.
I have updated the README now, you can use the flake template. For you, it will look like:
mkdir my-ollama-flake && cd ./my-ollama-flake
nix flake init -t github:shivaraj-bh/ollama-flake#rocm
nix run
Or if you already have an existing flake where you would like to integrate this into, you can look at the examples and grab the necessary pieces
I decided to not depend on services-flake
in ollama-flake
(yet to update README
). I was unnecessarily keeping only the ollama
service in the former and rest everything in the latter. let’s keep it simple! I have decided to bundle everything related to ollama in this single repository (from the server, to frontends, also the CLI clients and so on).
Aside, I do have some ambitious plans for ollama-flake
in the future, one of which being:
Provide a just generate-doc <service>
command in services-flake. This command will run a process-compose
app that will start the ollama
server (configured by ollama-flake
), run a CLI client (gotta find something like smartcat
for ollama
), provide the context of docs of other services and the tests for <service>
and out comes the doc for <service>
.
Let’s see how this idea fares
So if I get this correctly you want to write tests and docs in an automated fashion?
What would be the input of <service>? Some git repo with an existing codebase?
Yes, at the moment focused only on docs. <service> here could be one of many that we support in services-flake. For example, Postgres or MySQL or Redis and so on. We do have docs for Postgres, but not for MySQL and many other services.
Alright... one thing I noticed that confused me is this: your flake is now not really a flake for providing ollama, but a set of flake templates for providing ollama. If this is to be permanent, maybe the repo should be renamed again?
There are other repos that follow the same approach of providing flake-templates/flake-module, like services-flake, rust-flake, just-flake. I just went with the same naming convention. What do you think will be more appropriate?
FYI
https://discourse.nixos.org/t/ollama-cant-find-libstdc-so-6/44305
I doubt he is using the Ollama package from nixpkgs, otherwise this shouldn’t occur.
Aside: TIL that NixOS has a service for ollama:
https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/services/misc/ollama.nix
Yes, it just starts the server. Pulling the model is a manual process
Ollama v0.1.33 with Llama 3, Phi 3, and Qwen 110B
https://news.ycombinator.com/item?id=40191723
yeah I need to test ollama 0.1.33 actually
Looks like private-gpt uses ollama
It'd be interesting to be able to nix run
it, and feed it local documents for querying.
The nixpkgs PR has a module that is NixOS only, obviously.
This is cool, I was looking to add one more UI before I announce. This is a good candidate
There is a package request for open-webui in nixpkgs, someone just referenced to what I have packaged in ollama-flake: https://github.com/NixOS/nixpkgs/issues/309567#issuecomment-2105940033
If a PR gets merged in nixpkgs for the above issue, we can switch over to that.
Nice to see features getting added to nixpkgs
inspired by ollama-flake
: https://github.com/NixOS/nixpkgs/pull/313606
Screenshot-2024-05-25-at-10.15.52PM.png
private-gpt in ollama-flake! I will share the nix command for you to try out in a bit, I am yet to push the commits.
In the screenshot you are seeing llama3:8b querying on https://nixos.asia/en/blog/replacing-docker-compose
There you go:
nix run "github:shivaraj-bh/ollama-flake/private-gpt?dir=example/private-gpt” --refresh
works on Linux and macOS
Trying it out ...
Should be --refresh
not -refresh
.
Thanks, updated
And should be next to nix
, not at the end
And should be next to
nix
, not at the end
Seems to work even at the end
s/”
/"
/g
Is the readiness check fail
of concern here?
Alright, what should I do now? It is running.
readiness thing happens on macOS, I believe you remember it happened with open-webui too
Srid said:
Is the
readiness check fail
of concern here?
Exit code -1
, incidentally.
I need to fix that, for now restarting the process works
Okay, I restarted it, now what?
I expected it to automatically open something in my browser, TBH
the browser is available at 8001 port, I need to add the open-browser process to this
Its a TODO
Yea, that would be helpful - especially if it is cross-platform (xdg-open
vs open
?)
So there are these two things to address before we have nix run ...
just work?
is this supposed to run on AMD / ROCm as well? I might try later...
Yea, that would be helpful - especially if it is cross-platform (
xdg-open
vsopen
?)
open-webui already has this, I need to generalise it: https://github.com/shivaraj-bh/ollama-flake/blob/b61859956129b63fc6e2c8ad1ab4c8d13cc6cc96/services/open-webui.nix#L87-L97
is this supposed to run on AMD / ROCm as well? I might try later…
Yes
Srid said:
s/
”
/”
/g
zulip does something with the text while copy pasting, which changed the quotation and also the “—“ became “-“
Edit: I can’t reproduce, maybe it was something else.
Uploaded it an Ikea invoice,
There's a broken image link, though.
Pointing to http://localhost:8001/file=/Users/srid/code/the-actualism-way/private_gpt/ui/avatar-bot.ico
(/Users/srid/code/the-actualism-way/
is the $PWD)
When I open that link, the text response is:
{"detail":"File not allowed: /Users/srid/code/the-actualism-way/private_gpt/ui/avatar-bot.ico."}
It is missing the ./data/
parent directory.
Noted, 3 things to fix then!
But I can't find the .ico
file in the ./data
directory. So 4th thing?
It is probably in the private-gpt’s source, might have to copy it from there.
Yes, it is: https://github.com/zylon-ai/private-gpt/blob/main/private_gpt/ui/avatar-bot.ico
might have to copy it from there.
Or just point it to the /nix/store.. path
Noted, 3 things to fix then!
Fixed one of them: https://github.com/shivaraj-bh/ollama-flake/commit/1c19aadfbb975b2f16f05163322b677d13201760
Now the browser should open as soon as private-gpt is healthy
Uploading https://www.foo.be/docs-free/social-architecture/book.pdf
Took maybe 30 seconds to process. But the results are underwhelming:
Also, must it always create $PWD/data
? Why not ~/.ollama-flake/data
?
I think, that’s a better default, we can reuse the pre-loaded models
Took maybe 30 seconds to process. But the results are underwhelming:
I am yet to play around with larger documents. Works fine with smaller one’s. I have tried one with 2000-3000 words.
This thing creates a tiktoken_cache
directory in $PWD for some reason.
I played a little with some RAG stuff in the open webui, and it was underwhelming as well. There must be some kind of better approach I haven't yet figured out. Maybe we should ask some other people around here who might know more?
At least larger books didn't really work at all. But maybe having smaller documents is somewhat important.
upstreaming open-webui to nixpkgs: https://github.com/NixOS/nixpkgs/pull/316248
Very nice!
Last updated: Nov 15 2024 at 11:45 UTC