This was the smoke test to see if the model and Claude even talk together. In this test case we got no difference and all models created the file test_tools.txt with this content:
Minimalistic test cases like this one are also a great help when it comes to judging the quality of an implementation. If you judging function gives scores from 10 to 100 points for the same two-word string, then you know that you need more than just a few adjustments…
Test B: Build a NuGet package
Here things start to get interesting. We need the two files Program.cs and a *.csproj file as a minimum to build the project and turn it into a NuGet package.
Program.cs
The minimal Program.cs file comes from OpenAI GPT OSS 120B on the GB10 and looks like this:
usingSystem;// A simple global .NET tool that randomly selects an extensive "no" justification.// The reasons are deliberately verbose and framed in a software‑development context.string[]reasons=newstring[]{"After careful consideration of the current sprint backlog and resource allocation, we cannot accommodate this request at this time.","Given the existing technical debt and the need to prioritize critical bug fixes, it's advisable to decline this change.","The proposed modification would introduce a breaking API contract that conflicts with our versioning policy, therefore we must reject it.","Our performance benchmarks indicate that implementing this feature would degrade latency beyond acceptable thresholds, so we say no.","Due to upcoming release freeze and the need to maintain stability for production customers, this request cannot be approved.","The architectural guidelines dictate a single source of truth; adding this redundancy would violate those principles, thus we must refuse.","Security audit findings highlight that this change could open attack vectors, making it unsafe to proceed.","Our CI pipeline already runs at capacity, and adding additional build steps for this feature would cause timeouts, so we decline.","The licensing implications of the third‑party library required for this request are incompatible with our compliance standards.","Stakeholder consensus has not been reached on the priority of this work, leading us to defer it for now.","Given the current sprint velocity and the risk of scope creep, we need to say no to maintain focus.","The change would increase code complexity beyond our maintainability threshold, so we must reject it.","Our monitoring alerts show that similar changes have caused regressions in the past; therefore we decline this one.","The requested feature conflicts with our long‑term roadmap objectives and should not be pursued now.","Resource constraints on the DevOps team make supporting this addition impractical at present.","Compliance requirements mandate a review that we have not yet completed, so we cannot proceed.","The proposed UI alteration would break accessibility standards, necessitating a refusal until it is remediated.","Our load‑testing results indicate insufficient capacity for this change; we must say no to avoid service degradation.","Integration testing coverage would be severely reduced by this addition, violating our quality gates.","Given the current priority of security patches, this request is deprioritized and therefore denied."};varrandom=newRandom();intindex=random.Next(reasons.Length);Console.WriteLine(reasons[index]);
It uses the new syntax for minimal applications in C# and matches what you can get with the Claude models. Most other local models go for the old style with code wrapped in the namespace block:
// Program.cs – simple .NET global tool that prints a random, verbose "no" reason.// The tool is intentionally lightweight and has no external dependencies.usingSystem;namespacesay_no{classProgram{staticvoidMain(string[]args){// A collection of 20 extensive reasons that effectively say "no"varreasons=new[]{"After a thorough review of the current architecture and performance implications, we must decline this change.","Considering the existing technical debt and upcoming release schedule, implementing this would introduce unacceptable risk.","Given the constraints of our dependency graph and version compatibility, proceeding with this request is not feasible.","The proposed modification conflicts with core design principles outlined in our architecture guide, so we cannot accept it.","Due to limited sprint capacity and higher priority items, we must defer this change indefinitely.","Our current testing strategy would not provide sufficient coverage for such a change, making acceptance unsafe.","Integrating this feature would violate the established API contract and break downstream consumers.","The performance benchmarks indicate that this addition would degrade latency beyond acceptable thresholds.","Security review flags several potential vulnerabilities in the suggested approach; therefore we must reject it.","Resource allocation limits on our infrastructure prevent us from supporting this additional workload.","The change introduces a circular dependency that undermines module isolation, so it cannot be merged.","Our compliance checklist marks this modification as non‑compliant with regulatory requirements.","User experience guidelines dictate a different flow; implementing this would cause inconsistency.","Given the lack of clear acceptance criteria, we are unable to proceed with implementation.","The current codebase lacks the necessary abstractions to safely incorporate this change.","We have identified a more critical bug that must be addressed before considering this request.","Our continuous integration pipeline would fail due to missing environment variables required by this feature.","Stakeholder feedback has indicated a preference for an alternative solution, so we will not move forward with this one.","The proposed change exceeds the allocated budget for this quarter's development effort.","In summary, the combination of technical constraints and project priorities leads us to decline this request."};varrandom=newRandom();// Select a random reason and write it to standard output.Console.WriteLine(reasons[random.Next(reasons.Length)]);}}}
This is perfectly fine and may still be the way most C# developers write their code. If you have a preference, make sure that you specify it in your prompt.
While most solutions match one of these two samples, there are two outliers that need to be addressed. The Devstral 2 small model went on the ROG for an approach with a separate reasons file:
This feature cannot be implemented at this time due to architectural constraints that would require significant refactoring of the existing codebase.
The current implementation does not support this functionality because it was designed with a different use case in mind, and adding this capability would violate the principle of separation of concerns.
While we understand the value of this feature, our team is currently focused on stabilizing the core platform before adding new capabilities to ensure long-term maintainability.
This change would introduce too many dependencies that could affect system performance and reliability, which are critical for our production environment.
The existing codebase uses patterns that make this specific implementation approach infeasible without creating technical debt that would be costly to maintain.
We've analyzed this request thoroughly and determined it falls outside the scope of our current product roadmap, which is focused on other strategic initiatives.
This functionality would require changes to multiple layers of our architecture, creating integration challenges that would take considerable time to resolve properly.
While technically possible, implementing this feature would introduce complexity that could negatively impact developer productivity and code maintainability.
Our security team has identified potential vulnerabilities in this approach that would need extensive mitigation before we could consider implementation.
The current system design doesn't accommodate this use case within acceptable performance constraints for our target user base.
This change would require API modifications that could break existing integrations, creating compatibility issues for our customers.
While the idea is interesting, it doesn't align with our long-term vision for how this product should evolve in the marketplace.
The implementation complexity and testing requirements for this feature would significantly delay other higher-priority initiatives.
Our infrastructure team has identified capacity limitations that make this particular approach unfeasible at our current scale.
This functionality would require data model changes that could impact existing workflows and business processes.
While we appreciate the suggestion, it doesn't address a critical pain point for our primary user segments based on recent customer feedback.
The technical debt accumulated in this area makes implementing new features extremely risky without first addressing fundamental issues.
Our compliance requirements prevent us from implementing solutions that don't meet specific regulatory standards for data handling.
This change would require changes to our monitoring and observability systems, creating operational complexity we're not prepared to manage.
While the concept is sound, the implementation details present significant challenges that would require more resources than currently available.
Our team has determined this feature falls into a category of "nice-to-have" rather than "must-have" based on our strategic priorities.
While that could work, its empty lines in the file are converted to reasons as well and we end up often with no answer at all. That was not what I had in mind.
The other noteworthy outlier was Z.ai GLM 4.7 on the ROG that wanted so much to say yes that it created a flag that would allow our say_no tool to say yes:
if(args.Length>0&&args[0]=="--yes"){Console.WriteLine("Yes, absolutely. This aligns perfectly with our technical roadmap and architectural vision, and we can proceed with immediate implementation.");}else{// Select a random reasonvarrandom=newRandom();varindex=random.Next(reasons.Count);// Print the resultConsole.WriteLine(reasons[index]);}
An even more subtle bug was in the solution made by Z.ai GLM 4.7 on the 5090, where it only came up with 6 reasons and repeated them until it got to the 20 ones I requested.
Overall, I was positively surprised how well the coding part of this test case was solved.
*.csproj
In the *.csproj file we got a wide range of solutions that went from only adding the minimal configuration to make it an exe to the full-blown option with all metadata fields set. To create the global tool with NuGet, we would need these two settings inside the *.csproj file:
In the configuration part only Gemma 4 E4B failed to deliver, while all other models got the important part right and had variations in the metadata they included.
Test C: Task list in Python
Here we used Claude Code to develop a fully functional task tracker. There should be enough examples in the training data, but we still ended up with a wide range of solutions. On the first glance they all work and we could run through the commands we specified in the prompt. A deeper analysis shows that there are problems waiting when we have a user who does more than just a few operations.
The most common problem is the generation of the next Id. Here most LLMs generated code that looks at the current number of entries and adds 1. That works until we delete entries and then add new ones.
Another way to mess real users up is the handling of invalid JSON. All but one solution deletes the JSON file and starts fresh. Only OpenAI GPT OSS 120B on the 5090 moved the invalid JSON file to a *.bak and then created a fresh one.
The argument parsing falls into two groups: those who use the argparse module and those who manually fiddle with sys.argv. While both work, the approach with argparse is much more robust and part of Python for years.
The best implementation that got all points right was from OpenAI GPT OSS 120B on the 5090. It was at the same level as Opus 4.8 and ahead of every other model, including the new Fable 5:
#!/usr/bin/env python3"""Simple command‑line task tracker.Usage: python tasks.py add "Buy milk" python tasks.py list # show incomplete tasks python tasks.py list --all # show all tasks, including completed ones python tasks.py done 2 # mark task ID 2 as completed python tasks.py delete 2 # remove task ID 2The script stores tasks in a JSON file (tasks.json) located next to the script.Each task is a dict with the following keys: id – integer, unique identifier description – string, user‑provided text created_at – ISO‑8601 timestamp of creation completed – boolean flag indicating completion status"""importargparseimportjsonimportosimportsysfromdatetimeimportdatetimefromtypingimportList,Dict,Any# Path to the JSON file storing tasks. It lives next to this script.SCRIPT_DIR=os.path.abspath(os.path.dirname(__file__))TASKS_FILE=os.path.join(SCRIPT_DIR,"tasks.json")Task=Dict[str,Any]defload_tasks()->List[Task]:"""Load the task list from ``TASKS_FILE``. If the file does not exist an empty list is returned. If the JSON is malformed we back up the corrupted file and start with a fresh list. """ifnotos.path.exists(TASKS_FILE):return[]try:withopen(TASKS_FILE,"r",encoding="utf-8")asf:data=json.load(f)# Ensure we have a list of dicts – defensive programming.ifisinstance(data,list):returndataelse:raiseValueError("JSON root is not a list")except(json.JSONDecodeError,OSError,ValueError)asexc:print(f"Warning: could not read tasks file ({exc}); starting with an empty list.",file=sys.stderr)# Preserve the bad file for inspection.backup=TASKS_FILE+".bak"try:os.replace(TASKS_FILE,backup)print(f"Corrupted file renamed to {backup}",file=sys.stderr)exceptOSError:pass# If we cannot back up, just ignore – the original will be overwritten.return[]defsave_tasks(tasks:List[Task])->None:"""Write ``tasks`` back to ``TASKS_FILE`` using UTF‑8 and pretty printing."""withopen(TASKS_FILE,"w",encoding="utf-8")asf:json.dump(tasks,f,indent=2,ensure_ascii=False)defadd_task(description:str)->None:tasks=load_tasks()next_id=max((t["id"]fortintasks),default=0)+1task:Task={"id":next_id,"description":description,"created_at":datetime.now().isoformat(timespec="seconds"),"completed":False,}tasks.append(task)save_tasks(tasks)print(f"Added task {next_id}: {description}")deflist_tasks(show_all:bool)->None:tasks=load_tasks()filtered=tasksifshow_allelse[tfortintasksifnott.get("completed",False)]ifnotfiltered:print("No tasks to display.")return# Sort by ID for stable output.filtered.sort(key=lambdat:t["id"])fortinfiltered:status="✓"ift.get("completed")else" "created=t.get("created_at","?")print(f"[{t['id']}] [{status}] {t['description']} (added: {created})")defmark_done(task_id:int)->None:tasks=load_tasks()fortintasks:ift["id"]==task_id:ift.get("completed"):print(f"Task {task_id} is already marked as done.")else:t["completed"]=Truesave_tasks(tasks)print(f"Task {task_id} marked as completed.")returnprint(f"Error: no task with ID {task_id} found.",file=sys.stderr)defdelete_task(task_id:int)->None:tasks=load_tasks()new_tasks=[tfortintasksift["id"]!=task_id]iflen(new_tasks)==len(tasks):print(f"Error: no task with ID {task_id} found.",file=sys.stderr)returnsave_tasks(new_tasks)print(f"Task {task_id} deleted.")defmain()->None:parser=argparse.ArgumentParser(description="Simple task tracker")subparsers=parser.add_subparsers(dest="command",required=True)# add commandadd_parser=subparsers.add_parser("add",help="Add a new task")add_parser.add_argument("description",help="Task description")# list commandlist_parser=subparsers.add_parser("list",help="List tasks")list_parser.add_argument("--all",action="store_true",help="Show completed and incomplete tasks")# done commanddone_parser=subparsers.add_parser("done",help="Mark a task as completed")done_parser.add_argument("id",type=int,help="ID of the task to mark as done")# delete commanddel_parser=subparsers.add_parser("delete",help="Delete a task")del_parser.add_argument("id",type=int,help="ID of the task to delete")args=parser.parse_args()ifargs.command=="add":add_task(args.description)elifargs.command=="list":list_tasks(show_all=args.all)elifargs.command=="done":mark_done(args.id)elifargs.command=="delete":delete_task(args.id)else:parser.print_help()if__name__=="__main__":main()
The worst example was Gemma4 E4B on the ROG, that not only made the same mistakes as most other models but came up with many more brittle ways to fail and topped all that with its own data structure.
This test case was a great reminder on how important thorough testing is to spot problems like the Id creation. Then not only local LLMs produce this problem, but also some of the Claude models. Therefore, do not trust the code and test it yourself!
Test D: Broken CSV summariser
While this test case looks easy, it is a treasure trove of subtle bugs. For example, these two snippets look similar, but only one will do what one expects:
While solution A skips over rows with missing values, solution B will get an empty string back and continues until it crashes at the cast to int.
Only these two models generated code capable of handling real-world CSV issues (such as empty rows, invalid text or missing headers) by using defensive try-except blocks and safe default values for .get():
Gemma 4 E4B
OpenAI GPT OSS 120B
This is how OpenAI GPT OSS 120B solved this problem on the ROG laptop:
Claude Fable 5: Unsafe parsing; lacks error handling if fields contain non-numeric data.
Claude Haiku 4.5: Unsafe parsing & brittle read; lacks error handling and risks a KeyError on direct dictionary lookups.
Claude Opus 4.8: Unsafe parsing; lacks error handling and forces strict int() casting.
Claude Sonnet 4.6: Unsafe parsing; lacks error handling and forces strict int() casting.
CSV parsing is hard and easy to mess up - even with the official Claude models.
Test E: Find Hashtags
Finding hashtags was much better solved than the previous challenge. Partly because finding hashtags is a common example for text parsing and with that there should be lots of examples in the training data.
The models that did particular well with optimal regular expressions and good overall performance were these:
Qwen3 Coder 30B (5090 / ROG / GB10)
Gemma 4 E4B (5090 / HP)
As an example, this is how Qwen3 Coder 30B solved this problem on the 5090:
defhashtags(text):# Find all hashtag words (including those with trailing punctuation)hashtag_pattern=r'#\S+'matches=re.findall(hashtag_pattern,text)# Process each matchprocessed_hashtags=[]seen=set()formatchinmatches:# Remove the leading '#'hashtag=match[1:]# Strip trailing punctuation (but keep internal punctuation like hyphens and underscores)hashtag=re.sub(r'[.,!?:;]+$','',hashtag)# Lowercasehashtag=hashtag.lower()# Skip if empty after processingifnothashtag:continue# Only add if we haven't seen it beforeifhashtagnotinseen:processed_hashtags.append(hashtag)seen.add(hashtag)returnprocessed_hashtags
Another nice solution came from Gemma 31B on the GB10. Instead of manually maintaining a set() alongside a result list within the loop, this solution collects all tags first and then uses Python's built-in dict.fromkeys() trick to deduplicate them.
Since Python 3.7+, dictionaries are guaranteed to maintain insertion order. Passing a list through dict.fromkeys() creates a dictionary where the tags are keys (which inherently removes duplicates) and calling list() on it converts it back. It is clean, highly idiomatic, fast, and achieves optimal time complexity. However, the not optimal string manipulation around tag pushed it out of the leading group:
defhashtags(text):tags=[]forwordintext.split():ifword.startswith("#"):tag=word[1:].rstrip(",.!?:;").lower()iftag:tags.append(tag)# Deduplicate while preserving orderreturnlist(dict.fromkeys(tags))
At the end of the list was Sonnet 4.6, because it only used a list and made all checks against that “unique” list what will kill the performance for long texts:
Finding those performance differences was not possible with the minimalistic tests we got to check the functionality. But let us keep that in mind for our production code we write with Claude Code, no matter the model and if it runs in the cloud or on the local machine.
Conclusion on code quality
On the initial checks all models produced working code. Some got there at the first attempt; others needed a few extra rounds by passing the error messages back to Claude. In either case, what came out looked good and matched most if not all my expectations.
In a more thorough examination, we see that there are hidden problems, like missing error handling, too generous validations or repeating phrases to say no. Finding these kinds of problems is a much harder task and is often overlooked.
With only the small set of test cases I used, I got already very mixed reviews for the different models. For example, Gemma 4 E4B did great in test case D but failed both B and C. Depending on the task you have, it may be fast but wrong or slower and produces better results.
Both Z.ai GLM 4.7 and Devstral 2 small did not convince me, too often did they produce results that were not that good and combined with the time they took on the laptops it is just not worth to wait so long for such a result.
For the next tests I would focus on OpenAI GPT OSS 120B and Qwen3 Coder Next. They are not the fastest models, but they create a decent solution that covers more edge cases than most other models.
Next
After this deep dive into local LLMs and Claude Code we can go one step further and try to run GSD PI with a local LLM. Then the changes in pricing in Claude Code hit tools like GSD PI with full force.