How to Run Claude Code With a Local LLM (2/3)
In the first part last week we saw what we need to run Claude Code with a local LLM. In this second part we take a closer look at the different models and how they perform on different machines. Then the "right" model does not help us much if we cannot run it with the needed context size or when it only produces a few tokens per second. This is the hard part of running Claude Code against a local LLM and there is no solution that works everywhere.
Hardware Setup for the tests
Thanks to the support of friends, I could use these 4 machines to run the tests:
| Name | Device | CPU | GPU | RAM |
|---|---|---|---|---|
| HP | HP ZBook Ultra G1a | AMD Ryzen AI MAX+ PRO 395 | Radeon 8060S | 128 GB |
| ROG | Asus ROG Strix | Intel Ultra 9 275HX | NVIDIA RTX 5080 Laptop | 64 GB |
| 5090 | Acer Predator 7000 | Intel Ultra 9 285K | NVIDIA RTX 5090 | 128 GB |
| GB10 | DELL GB10 | NVIDIA GB10 Grace CPU | NVIDIA GB10 Blackwell GPU | 128 GB |
The HP and the ROG are laptops, while the 5090 and the GB10 are desktop computers.
Test cases
To get an idea on how the different models work I used the following 5 test cases. They are small but still offer a few challenges to make it interesting.
A: Create a file
The most basic task is to create a text file with some content. For that I use this prompt:
B: Build a NuGet package
For .Net / C#, I reuse this NuGet package prompt that I have shown in the past with a slight modification so that it should run on all systems:
C: Task list in Python
Not all models are good at C#, that is why this task uses Python to create a task list tool. The requirements are straightforward and all models should be able to implement them:
D: Broken CSV summariser
In this test we have existing code that needs a fix on how the data is converted before it is processed. It should show us a bit of variation in how the models solve this problem:
The pre-existing code files are in this *.zip file.
E: Find Hashtags
The last challenge should show us how the model uses regular expressions and how it organises the logic to fix existing code. It may even give us some options on how to keep the order of the hashtags while checking for uniqueness:
Baseline
For our baseline we run with Claude Code through the different models and see how long they take for our test cases:
| Opus 4.8 | Sonnet 4.6 | Haiku 4.5 | |
|---|---|---|---|
| A | 5s | 6s | 5s |
| B | 40s | 1m 6s | 1m 5s |
| C | 35s | 27s | 31s |
| D | 37s | 27s | 19s |
| E | 22s | 23s | 18s |
The outlier with Sonnet 4.6 and Haiku 4.5 comes from a problem with the NuGet package test. The code did not work, and it took an extra round to fix the problem.
The timings for the Claude models vary throughout the day, depending on how many developers run Claude Code. You can get it faster, but it may as well take longer to do the test cases.
Results
Here are the timings it took to get to a successful solution for each test case. In most cases the first shot hit the target, but sometimes I had to give it another try to complete the task.
All models used a context size of 50k tokens. That is only a fraction of the possible context size for the larger models, but with the 50k I could run the models on all devices and got comparable results. The context size was only in one example a problem, but there the model went so far away that I had to redo the test from the start.
On each test, I started with a fresh Claude Code session and run the test case. Then I checked if the task was a success. Should this not be the case, I pasted up to three times the error message or problem description. That was enough to get the desired result with each model. The timing covers only the time Claude Code took to deliver the results (including the retries), but not the time I needed to verify the outcome.
Qwen3 Coder 30B
Qwen3 Coder (qwen/qwen3-coder-30b) is a 30B model with support for tool use. It used on all machines .Net 6 for the NuGet package, what I changed to .Net 10 (there is no .Net 6 on the machines). The other test cases did not need any changes.
| HP | ROG | 5090 | GB10 | |
|---|---|---|---|---|
| A | 6m 46s | 1m 52s | 5s | 16s |
| B | 4m 10s | 13m 1s | 14s | 2m 13s |
| C | 43m 8s | 48m 39s | 14s | 2m 15 |
| D | 27m 44s | 14m 59s | 39s | 1m 10s |
| E | 25m 57s | 14m 42s | 22s | 1m 10s |
If you have an RTX 5090 graphics card, then you can achieve a comparable speed to the Claude models.
Qwen3 Coder Next
Qwen3-Coder-Next is an 80B model that takes a lot more space than Qwen3 Coder. The increased size is enough to give the GB10 a significant advantage over an RTX 5090.
| HP | ROG | 5090 | GB10 | |
|---|---|---|---|---|
| A | 2m 32s | 2m 27s | 1m 36s | 25s |
| B | 3m 43s | 9m 51s | 3m 34s | 1m 9s |
| C | 6m 33s | 3m 41s | 3m 43s | 1m 25s |
| D | 2m 50s | 3m 57s | 2m 25s | 37s |
| E | 2m 59s | 3m 47s | 2m 17s | 39s |
This model is newer than the Qwen3 Coder 31B model and went for .Net 8 in the NuGet test case – no changes necessary.
Z.ai GLM 4.7
GLM-4.7 by Z.ai is a 30B model that is not far behind the speed of the Claude models on fast hardware. However, this model was the only one that in one test run created a “yes” option for the say_no NuGet package.
| HP | ROG | 5090 | GB10 | |
|---|---|---|---|---|
| A | 1m 10s | 1m 24s | 9s | 30s |
| B | 18m 3s | 4m 24s | 32s | 1m 13s |
| C | 28m 17s | 5m 57s | 26s | 59s |
| D | 15m 3s | 3m 18s | 12s | 33s |
| E | 8m 50s | 4m 57s | 9s | 31s |
Unfortunately, the smaller GLM 4.6 model is not compatible with the tool calls of Claude Code. It would be even faster than the 4.7 model, but without a translation bridge it does not work in Claude Code.
Gemma 4 E4B
Gemma 4 effective 4B (gemma-4-e4b) is a relatively small model that works with reasonable speed on an Asus ROG laptop. However, it had a few problems with task C that required multiple rounds of bug fixing.
| HP | ROG | 5090 | GB10 | |
|---|---|---|---|---|
| A | 1m 16s | 9s | 4s | 18s |
| B | 11m 5s | 49s | 23s | 54s |
| C | 6m 57s | 3m 11s | 31s | 2m 9s |
| D | 6m 11s | 35s | 21s | 2m 35s |
| E | 12m 4s | 51s | 27s | 2m 10s |
Gemma4 31b
Gemma4 31b (google/gemma-4-31b) supports reasoning and tool support but it is a bit too large for the ROG laptop. This model is also picky when it comes to the runtime in LM Studio. I had to switch to the Vulcan runtime on the HP and the CUDA llama.cpp on the ROG or each task would take up to an hour longer.
| HP | ROG | 5090 | GB10 | |
|---|---|---|---|---|
| A | 57s | 2m 55s | 16s | 58s |
| B | 14m 7s | 37m 46s | 1m 0s | 6m 9s |
| C | 12m 50s | 33m 47s | 50s | 4m 19s |
| D | 13m 49s | 49m 17s | 51s | 4m 41s |
| E | 22m 33s | 41m 14s | 41s | 4m 2s |
Devstral 2 small
Devstral 2 (mistralai/devstral-small-2-2512) by Mistral AI is an older model that still gets often recommended. I only would use this model on an RTX 5090, otherwise it is too slow and makes too many mistakes that take time to fix.
| HP | ROG | 5090 | GB10 | |
|---|---|---|---|---|
| A | 15m 9s | 18m 52s | 8s | 5s |
| B | 47m 59s | 106m 40s | 1m 7s | 2m 36s |
| C | 93m 20s | 90m 57s | 1m 34s | 8m 43s |
| D | 63m 48s | 67m 11s | 44s | 2m 20s |
| E | 19m 27s | 72m 41s | 35s | 2m 51s |
OpenAI GPT OSS 120B
OpenAI’s GPT OSS 120B (openai/gpt-oss-120b) is the largest model in this test. I expected that I could only run it on the RTX 5090 and the GB10, but it worked also on the two laptops. With its size we see the benefits of the larger unified memory that gives the GB10 a massive advantage over the RTX 5090, while the HP laptop with its larger shared memory beats the ROG laptop.
| HP | ROG | 5090 | GB10 | |
|---|---|---|---|---|
| A | 3m 5s | 8m 17s | 3m 3s | 29s |
| B | 11m 30s | 13m 12s | 15m 39s | 2m 54s |
| C | 21m 29s | 32m 25s | 14m 51s | 2m 47s |
| D | 11m 28s | 22m 18s | 9m 6s | 5m 1s |
| E | 7m 23s | 20m 57s | 5m 41s | 1m 13s |
Unfortunately, the 20B version of this model has problems with the tool calls, what makes it useless for the work with Claude Code, the same problem we have with the smaller GLM model.
Conclusion on speed
The 3 official models in Claude Code offer a great responsiveness and quality that is not easy to match with a local LLM. As long as we can run on the Claude models at the price range we have at the start of June, the hosted models are the sensible choice.
Should the prices go up significantly or no longer be capped with a monthly subscription, then we can choose between multiple local LLMs that deliver similar speed – given we invest in powerful enough hardware. As we can see in this post with its small sample of test cases, it matters what model you use on your hardware. As long as it fits the (V)RAM of your GPU, you can get fast results. If it is too big, then performance goes down a lot. But there are more than the model and the hardware that needs to match.
On my first tests the performance of the AMD based HP laptop was terrible. To create the NuGet package with Gemma4 31b model it took 78 minutes with the ROCm llama.cpp runtime. As I switched to the Vulkan llama.cpp runtime, the time went down to 14 minutes. On the other hand, the Vulkan runtime got me a lot worse results with the Z.ai GLM 4.7 model. Once you have a reasonable number of models on your shortlist, make sure you test not only the models themselves but also how they perform on the various runtimes we get in LM Studio.
Next
With this post we got a lot of numbers on speed. Next week we take a deeper dive into the produced code and see what the local models created. Then even when everything works, there are noteworthy differences that could influence our choice for a local model.