How to Run Claude Code With a Local LLM (2/3)

In the first part last week we saw what we need to run Claude Code with a local LLM. In this second part we take a closer look at the different models and how they perform on different machines. Then the "right" model does not help us much if we cannot run it with the needed context size or when it only produces a few tokens per second. This is the hard part of running Claude Code against a local LLM and there is no solution that works everywhere.

Hardware Setup for the tests

Thanks to the support of friends, I could use these 4 machines to run the tests:

Name	Device	CPU	GPU	RAM
HP	HP ZBook Ultra G1a	AMD Ryzen AI MAX+ PRO 395	Radeon 8060S	128 GB
ROG	Asus ROG Strix	Intel Ultra 9 275HX	NVIDIA RTX 5080 Laptop	64 GB
5090	Acer Predator 7000	Intel Ultra 9 285K	NVIDIA RTX 5090	128 GB
GB10	DELL GB10	NVIDIA GB10 Grace CPU	NVIDIA GB10 Blackwell GPU	128 GB

The HP and the ROG are laptops, while the 5090 and the GB10 are desktop computers.

Test cases

To get an idea on how the different models work I used the following 5 test cases. They are small but still offer a few challenges to make it interesting.

A: Create a file

The most basic task is to create a text file with some content. For that I use this prompt:

1	`Create a file called test_tools.txt with "hello world".`

B: Build a NuGet package

For .Net / C#, I reuse this NuGet package prompt that I have shown in the past with a slight modification so that it should run on all systems:

I need a NuGet package that I can install as a global tool. The tool 
should be named say_no. When we run it, it should pick one of 20 
predefined extensive reasons that basically say no but uses a lot more 
words in the context of software development. Crate the tool, show the 
commands to create the package to install it on the local machine. Only 
show the command, do not install it. Make it as simple as possible.

C: Task list in Python

Not all models are good at C#, that is why this task uses Python to create a task list tool. The requirements are straightforward and all models should be able to implement them:

Build a small command-line task tracker in Python.

It should support these commands:

python tasks.py add "Buy milk"
python tasks.py list
python tasks.py done 2
python tasks.py delete 2
python tasks.py list --all

Requirements:
- Store tasks in a local JSON file.
- Each task should have an id, description, created_at timestamp, and 
  completed flag.
- list should show only incomplete tasks by default.
- list --all should show completed and incomplete tasks.
- done ID should mark a task complete.
- delete ID should remove a task.
- Handle missing or corrupted JSON files gracefully.
- Use only the Python standard library.

Put the file into the current directory. 

D: Broken CSV summariser

In this test we have existing code that needs a fix on how the data is converted before it is processed. It should show us a bit of variation in how the models solve this problem:

The CSV summariser currently calculates revenue incorrectly and does 
not handle blank lines well.

Update `summarise.py` so that it:

1. Reads an orders CSV file.
2. Calculates total revenue as `quantity * price` for each row.
3. Ignores blank lines.
4. Prints the result rounded to two decimal places.

Run the tests with:

pytest

Do not change the tests.

The pre-existing code files are in this *.zip file.

E: Find Hashtags

The last challenge should show us how the model uses regular expressions and how it organises the logic to fix existing code. It may even give us some options on how to keep the order of the hashtags while checking for uniqueness:

The `hashtags()` function currently extracts words that start with "#", 
but it does not clean them up correctly.

Update `hashtags.py` so that `hashtags(text)`:

1. Finds hashtag words in the input text.
2. Removes the leading "#".
3. Strips common trailing punctuation such as commas, periods, exclamation 
  marks, question marks, colons, and semicolons.
4. Lowercases each hashtag.
5. Removes duplicates while preserving the first occurrence order.
6. Ignores a lone "#" with no tag text.

Run the tests with:

pytest

Do not change the tests.

Baseline

For our baseline we run with Claude Code through the different models and see how long they take for our test cases:

	Opus 4.8	Sonnet 4.6	Haiku 4.5	Fable 5
A	5s	6s	5s	10s
B	40s	1m 6s	1m 5s	1m 28s
C	35s	27s	31s	40s
D	37s	27s	19s	58s
E	22s	23s	18s	26s

The outlier with Sonnet 4.6 and Haiku 4.5 comes from a problem with the NuGet package test. The code did not work, and it took an extra round to fix the problem.

The timings for the Claude models vary throughout the day, depending on how many developers run Claude Code. You can get it faster, but it may as well take longer to do the test cases.

Results

Here are the timings it took to get to a successful solution for each test case. In most cases the first shot hit the target, but sometimes I had to give it another try to complete the task.

All models used a context size of 50k tokens. That is only a fraction of the possible context size for the larger models, but with the 50k I could run the models on all devices and got comparable results. The context size was only in one example a problem, but there the model went so far away that I had to redo the test from the start.

On each test, I started with a fresh Claude Code session and run the test case. Then I checked if the task was a success. Should this not be the case, I pasted up to three times the error message or problem description. That was enough to get the desired result with each model. The timing covers only the time Claude Code took to deliver the results (including the retries), but not the time I needed to verify the outcome.

Qwen3 Coder 30B

Qwen3 Coder (qwen/qwen3-coder-30b) is a 30B model with support for tool use. It used on all machines .Net 6 for the NuGet package, what I changed to .Net 10 (there is no .Net 6 on the machines). The other test cases did not need any changes.

	HP	ROG	5090	GB10
A	6m 46s	1m 52s	5s	16s
B	4m 10s	13m 1s	14s	2m 13s
C	43m 8s	48m 39s	14s	2m 15
D	27m 44s	14m 59s	39s	1m 10s
E	25m 57s	14m 42s	22s	1m 10s

If you have an RTX 5090 graphics card, then you can achieve a comparable speed to the Claude models.

Qwen3 Coder Next

Qwen3-Coder-Next is an 80B model that takes a lot more space than Qwen3 Coder. The increased size is enough to give the GB10 a significant advantage over an RTX 5090.

	HP	ROG	5090	GB10
A	2m 32s	2m 27s	1m 36s	25s
B	3m 43s	9m 51s	3m 34s	1m 9s
C	6m 33s	3m 41s	3m 43s	1m 25s
D	2m 50s	3m 57s	2m 25s	37s
E	2m 59s	3m 47s	2m 17s	39s

This model is newer than the Qwen3 Coder 31B model and went for .Net 8 in the NuGet test case – no changes necessary.

Z.ai GLM 4.7

GLM-4.7 by Z.ai is a 30B model that is not far behind the speed of the Claude models on fast hardware. However, this model was the only one that in one test run created a “yes” option for the say_no NuGet package.

	HP	ROG	5090	GB10
A	1m 10s	1m 24s	9s	30s
B	18m 3s	4m 24s	32s	1m 13s
C	28m 17s	5m 57s	26s	59s
D	15m 3s	3m 18s	12s	33s
E	8m 50s	4m 57s	9s	31s

Unfortunately, the smaller GLM 4.6 model is not compatible with the tool calls of Claude Code. It would be even faster than the 4.7 model, but without a translation bridge it does not work in Claude Code.

Gemma 4 E4B

Gemma 4 effective 4B (gemma-4-e4b) is a relatively small model that works with reasonable speed on an Asus ROG laptop. However, it had a few problems with task C that required multiple rounds of bug fixing.

	HP	ROG	5090	GB10
A	1m 16s	9s	4s	18s
B	11m 5s	49s	23s	54s
C	6m 57s	3m 11s	31s	2m 9s
D	6m 11s	35s	21s	2m 35s
E	12m 4s	51s	27s	2m 10s

Gemma 4 31b

Gemma 4 31b (google/gemma-4-31b) supports reasoning and tool support but it is a bit too large for the ROG laptop. This model is also picky when it comes to the runtime in LM Studio. I had to switch to the Vulcan runtime on the HP and the CUDA llama.cpp on the ROG or each task would take up to an hour longer.

	HP	ROG	5090	GB10
A	57s	2m 55s	16s	58s
B	14m 7s	37m 46s	1m 0s	6m 9s
C	12m 50s	33m 47s	50s	4m 19s
D	13m 49s	49m 17s	51s	4m 41s
E	22m 33s	41m 14s	41s	4m 2s

Devstral 2 small

Devstral 2 (mistralai/devstral-small-2-2512) by Mistral AI is an older model that still gets often recommended. I only would use this model on an RTX 5090, otherwise it is too slow and makes too many mistakes that take time to fix.

	HP	ROG	5090	GB10
A	15m 9s	18m 52s	8s	5s
B	47m 59s	106m 40s	1m 7s	2m 36s
C	93m 20s	90m 57s	1m 34s	8m 43s
D	63m 48s	67m 11s	44s	2m 20s
E	19m 27s	72m 41s	35s	2m 51s

OpenAI GPT OSS 120B

OpenAI’s GPT OSS 120B (openai/gpt-oss-120b) is the largest model in this test. I expected that I could only run it on the RTX 5090 and the GB10, but it worked also on the two laptops. With its size we see the benefits of the larger unified memory that gives the GB10 a massive advantage over the RTX 5090, while the HP laptop with its larger shared memory beats the ROG laptop.

	HP	ROG	5090	GB10
A	3m 5s	8m 17s	3m 3s	29s
B	11m 30s	13m 12s	15m 39s	2m 54s
C	21m 29s	32m 25s	14m 51s	2m 47s
D	11m 28s	22m 18s	9m 6s	5m 1s
E	7m 23s	20m 57s	5m 41s	1m 13s

Unfortunately, the 20B version of this model has problems with the tool calls, what makes it useless for the work with Claude Code, the same problem we have with the smaller GLM model.

Conclusion on speed

The 3 official models in Claude Code offer a great responsiveness and quality that is not easy to match with a local LLM. As long as we can run on the Claude models at the price range we have at the start of June, the hosted models are the sensible choice.

Should the prices go up significantly or no longer be capped with a monthly subscription, then we can choose between multiple local LLMs that deliver similar speed – given we invest in powerful enough hardware. As we can see in this post with its small sample of test cases, it matters what model you use on your hardware. As long as it fits the (V)RAM of your GPU, you can get fast results. If it is too big, then performance goes down a lot. But there are more than the model and the hardware that needs to match.

On my first tests the performance of the AMD based HP laptop was terrible. To create the NuGet package with Gemma 4 31b model it took 78 minutes with the ROCm llama.cpp runtime. As I switched to the Vulkan llama.cpp runtime, the time went down to 14 minutes. On the other hand, the Vulkan runtime got me a lot worse results with the Z.ai GLM 4.7 model. Once you have a reasonable number of models on your shortlist, make sure you test not only the models themselves but also how they perform on the various runtimes we get in LM Studio.