Skip to content

How to Run Claude Code With a Local LLM (2/3)

In the first part last week we saw what we need to run Claude Code with a local LLM. In this second part we take a closer look at the different models and how they perform on different machines. Then the "right" model does not help us much if we cannot run it with the needed context size or when it only produces a few tokens per second. This is the hard part of running Claude Code against a local LLM and there is no solution that works everywhere.

Hardware Setup for the tests

Thanks to the support of friends, I could use these 4 machines to run the tests:

Name Device CPU GPU RAM
HP HP ZBook Ultra G1a AMD Ryzen AI MAX+ PRO 395 Radeon 8060S 128 GB
ROG Asus ROG Strix Intel Ultra 9 275HX NVIDIA RTX 5080 Laptop 64 GB
5090 Acer Predator 7000 Intel Ultra 9 285K NVIDIA RTX 5090 128 GB
GB10 DELL GB10 NVIDIA GB10 Grace CPU NVIDIA GB10 Blackwell GPU 128 GB

The HP and the ROG are laptops, while the 5090 and the GB10 are desktop computers.

Test cases

To get an idea on how the different models work I used the following 5 test cases. They are small but still offer a few challenges to make it interesting.

A: Create a file

The most basic task is to create a text file with some content. For that I use this prompt:

Create a file called test_tools.txt with "hello world".

B: Build a NuGet package

For .Net / C#, I reuse this NuGet package prompt that I have shown in the past with a slight modification so that it should run on all systems:

1
2
3
4
5
6
I need a NuGet package that I can install as a global tool. The tool 
should be named say_no. When we run it, it should pick one of 20 
predefined extensive reasons that basically say no but uses a lot more 
words in the context of software development. Crate the tool, show the 
commands to create the package to install it on the local machine. Only 
show the command, do not install it. Make it as simple as possible.

C: Task list in Python

Not all models are good at C#, that is why this task uses Python to create a task list tool. The requirements are straightforward and all models should be able to implement them:

Build a small command-line task tracker in Python.

It should support these commands:

python tasks.py add "Buy milk"
python tasks.py list
python tasks.py done 2
python tasks.py delete 2
python tasks.py list --all

Requirements:
- Store tasks in a local JSON file.
- Each task should have an id, description, created_at timestamp, and 
  completed flag.
- list should show only incomplete tasks by default.
- list --all should show completed and incomplete tasks.
- done ID should mark a task complete.
- delete ID should remove a task.
- Handle missing or corrupted JSON files gracefully.
- Use only the Python standard library.

Put the file into the current directory. 

D: Broken CSV summariser

In this test we have existing code that needs a fix on how the data is converted before it is processed. It should show us a bit of variation in how the models solve this problem:

The CSV summariser currently calculates revenue incorrectly and does 
not handle blank lines well.

Update `summarise.py` so that it:

1. Reads an orders CSV file.
2. Calculates total revenue as `quantity * price` for each row.
3. Ignores blank lines.
4. Prints the result rounded to two decimal places.

Run the tests with:

pytest

Do not change the tests.

The pre-existing code files are in this *.zip file.

E: Find Hashtags

The last challenge should show us how the model uses regular expressions and how it organises the logic to fix existing code. It may even give us some options on how to keep the order of the hashtags while checking for uniqueness:

The `hashtags()` function currently extracts words that start with "#", 
but it does not clean them up correctly.

Update `hashtags.py` so that `hashtags(text)`:

1. Finds hashtag words in the input text.
2. Removes the leading "#".
3. Strips common trailing punctuation such as commas, periods, exclamation 
  marks, question marks, colons, and semicolons.
4. Lowercases each hashtag.
5. Removes duplicates while preserving the first occurrence order.
6. Ignores a lone "#" with no tag text.

Run the tests with:

pytest

Do not change the tests.

Baseline

For our baseline we run with Claude Code through the different models and see how long they take for our test cases:

Opus 4.8 Sonnet 4.6 Haiku 4.5
A 5s 6s 5s
B 40s 1m 6s 1m 5s
C 35s 27s 31s
D 37s 27s 19s
E 22s 23s 18s

The outlier with Sonnet 4.6 and Haiku 4.5 comes from a problem with the NuGet package test. The code did not work, and it took an extra round to fix the problem.

The timings for the Claude models vary throughout the day, depending on how many developers run Claude Code. You can get it faster, but it may as well take longer to do the test cases.

Results

Here are the timings it took to get to a successful solution for each test case. In most cases the first shot hit the target, but sometimes I had to give it another try to complete the task.

All models used a context size of 50k tokens. That is only a fraction of the possible context size for the larger models, but with the 50k I could run the models on all devices and got comparable results. The context size was only in one example a problem, but there the model went so far away that I had to redo the test from the start.

On each test, I started with a fresh Claude Code session and run the test case. Then I checked if the task was a success. Should this not be the case, I pasted up to three times the error message or problem description. That was enough to get the desired result with each model. The timing covers only the time Claude Code took to deliver the results (including the retries), but not the time I needed to verify the outcome.

Qwen3 Coder 30B

Qwen3 Coder (qwen/qwen3-coder-30b) is a 30B model with support for tool use. It used on all machines .Net 6 for the NuGet package, what I changed to .Net 10 (there is no .Net 6 on the machines). The other test cases did not need any changes.

HP ROG 5090 GB10
A 6m 46s 1m 52s 5s 16s
B 4m 10s 13m 1s 14s 2m 13s
C 43m 8s 48m 39s 14s 2m 15
D 27m 44s 14m 59s 39s 1m 10s
E 25m 57s 14m 42s 22s 1m 10s

If you have an RTX 5090 graphics card, then you can achieve a comparable speed to the Claude models.

Qwen3 Coder Next

Qwen3-Coder-Next is an 80B model that takes a lot more space than Qwen3 Coder. The increased size is enough to give the GB10 a significant advantage over an RTX 5090.

HP ROG 5090 GB10
A 2m 32s 2m 27s 1m 36s 25s
B 3m 43s 9m 51s 3m 34s 1m 9s
C 6m 33s 3m 41s 3m 43s 1m 25s
D 2m 50s 3m 57s 2m 25s 37s
E 2m 59s 3m 47s 2m 17s 39s

This model is newer than the Qwen3 Coder 31B model and went for .Net 8 in the NuGet test case – no changes necessary.

Z.ai GLM 4.7

GLM-4.7 by Z.ai is a 30B model that is not far behind the speed of the Claude models on fast hardware. However, this model was the only one that in one test run created a “yes” option for the say_no NuGet package.

HP ROG 5090 GB10
A 1m 10s 1m 24s 9s 30s
B 18m 3s 4m 24s 32s 1m 13s
C 28m 17s 5m 57s 26s 59s
D 15m 3s 3m 18s 12s 33s
E 8m 50s 4m 57s 9s 31s

Unfortunately, the smaller GLM 4.6 model is not compatible with the tool calls of Claude Code. It would be even faster than the 4.7 model, but without a translation bridge it does not work in Claude Code.

Gemma 4 E4B

Gemma 4 effective 4B (gemma-4-e4b) is a relatively small model that works with reasonable speed on an Asus ROG laptop. However, it had a few problems with task C that required multiple rounds of bug fixing.

HP ROG 5090 GB10
A 1m 16s 9s 4s 18s
B 11m 5s 49s 23s 54s
C 6m 57s 3m 11s 31s 2m 9s
D 6m 11s 35s 21s 2m 35s
E 12m 4s 51s 27s 2m 10s

Gemma4 31b

Gemma4 31b (google/gemma-4-31b) supports reasoning and tool support but it is a bit too large for the ROG laptop. This model is also picky when it comes to the runtime in LM Studio. I had to switch to the Vulcan runtime on the HP and the CUDA llama.cpp on the ROG or each task would take up to an hour longer.

HP ROG 5090 GB10
A 57s 2m 55s 16s 58s
B 14m 7s 37m 46s 1m 0s 6m 9s
C 12m 50s 33m 47s 50s 4m 19s
D 13m 49s 49m 17s 51s 4m 41s
E 22m 33s 41m 14s 41s 4m 2s

Devstral 2 small

Devstral 2 (mistralai/devstral-small-2-2512) by Mistral AI is an older model that still gets often recommended. I only would use this model on an RTX 5090, otherwise it is too slow and makes too many mistakes that take time to fix.

HP ROG 5090 GB10
A 15m 9s 18m 52s 8s 5s
B 47m 59s 106m 40s 1m 7s 2m 36s
C 93m 20s 90m 57s 1m 34s 8m 43s
D 63m 48s 67m 11s 44s 2m 20s
E 19m 27s 72m 41s 35s 2m 51s

OpenAI GPT OSS 120B

OpenAI’s GPT OSS 120B (openai/gpt-oss-120b) is the largest model in this test. I expected that I could only run it on the RTX 5090 and the GB10, but it worked also on the two laptops. With its size we see the benefits of the larger unified memory that gives the GB10 a massive advantage over the RTX 5090, while the HP laptop with its larger shared memory beats the ROG laptop.

HP ROG 5090 GB10
A 3m 5s 8m 17s 3m 3s 29s
B 11m 30s 13m 12s 15m 39s 2m 54s
C 21m 29s 32m 25s 14m 51s 2m 47s
D 11m 28s 22m 18s 9m 6s 5m 1s
E 7m 23s 20m 57s 5m 41s 1m 13s

Unfortunately, the 20B version of this model has problems with the tool calls, what makes it useless for the work with Claude Code, the same problem we have with the smaller GLM model.

Conclusion on speed

The 3 official models in Claude Code offer a great responsiveness and quality that is not easy to match with a local LLM. As long as we can run on the Claude models at the price range we have at the start of June, the hosted models are the sensible choice.

Should the prices go up significantly or no longer be capped with a monthly subscription, then we can choose between multiple local LLMs that deliver similar speed – given we invest in powerful enough hardware. As we can see in this post with its small sample of test cases, it matters what model you use on your hardware. As long as it fits the (V)RAM of your GPU, you can get fast results. If it is too big, then performance goes down a lot. But there are more than the model and the hardware that needs to match.

On my first tests the performance of the AMD based HP laptop was terrible. To create the NuGet package with Gemma4 31b model it took 78 minutes with the ROCm llama.cpp runtime. As I switched to the Vulkan llama.cpp runtime, the time went down to 14 minutes. On the other hand, the Vulkan runtime got me a lot worse results with the Z.ai GLM 4.7 model. Once you have a reasonable number of models on your shortlist, make sure you test not only the models themselves but also how they perform on the various runtimes we get in LM Studio.

Next

With this post we got a lot of numbers on speed. Next week we take a deeper dive into the produced code and see what the local models created. Then even when everything works, there are noteworthy differences that could influence our choice for a local model.