jLuger.de - My experience with local LLMs

I have used local LLMs for several months and after some cloud based AI developer services increased their prices/limited the usage (like Cursor, Github Copilot, ...) I want to share my experience. The good news is that local LLMs have increased their results and provide good results. Sometimes. The mileage still varies.

The setup
I have never used online tools/chats and I never let the AI write files directly to the disk. Also I have used the local LLMs only as chat. Either via OpenWeb UI or the Continue plugin in Intellij. The software that executed the LLMs was Ollama. When using the Continue plugin I have provided sometimes my code as file context. The LLM then output the whole files with the changes I wanted and I copy&pasted the changes to the file.

The good
I've used the LLMs mostly to develop some simple Android apps. At first I've asked the LLMs to solve the problems I've had via chat. This provided some nice help but I still had to adapt the answer to my code. Then I've changed to the Continue plugin and provided the file I've been working on as context. This increased my development speed a lot because now I could copy&past the answer directly in my code. It also worked when I used the MainActivity generated by Intellij at project generation. E.g. I've just asked it to generate two text fields with labels and I've got the complete code. One time I saw that I had given the wrong text to one of the labels. I've changed the label text by hand in the code and then asked the LLM to do another change. Not only did it do this new change, it also noticed my label change and adjusted the variable name that belonged to the label.
Another time I've found some code I wanted to use but it wasn't in the right programming language. While the code was simple, it was a bit lengthy so I've asked a LLM to do the conversion for me. It actually worked.
I've also found some formulas in a picture. The formulas used fraction bars. I've asked a vision model to extract the formulas. Not only was the extracted text correct but when changing the fraction bar to the "/" sign, the LLM added the correct brackets to keep the formula correct.

The bad
While writing this text, I've tried the test of the vision model again but with another model. This time it didn't add the brackets, which made the extracted formula wrong. Of course this could also happen with the big LLMs from the cloud but with local LLMs this becomes a YOU problem. You have to test the LLMs and select them.
I've also tried to do some code conversion where the code was pretty complex/complex to translate (because language "a" offered features that didn't exists in language "b") and then the LLM provided too much crap to be useful.
While speaking about crap, although the LLMs sometimes do some really good work, a lot of the results were more „80%“ solution. With some knowledge in the programming language/framework the output was easy to fix. And this still saved some time but you definitely don’t want to use the output when you can’t verify it.
Then for some small tasks writing the query and waiting for the output took longer than just changing the code by hand.
I’ve also noticed that the quality of the output directly correlates with the quality of Google search results. When Google doesn’t return an answer, don’t waist your time by asking the LLM. To be honest I haven’t done it this way very often. Most of the time I got nonsense from the LLM just to find no easy solution on Google. In the other situations the LLM output was right but sounded like nonsense. So I went to Google to find plenty of search results that confirmed that the LLM was right.
Another thing is that all LLMs have a cut off date. This means that they have no knowledge about anything after this date. E.g. when there is a new release of a library with breaking changes after the cut off date, the LLM will generate code for the old version. While GPT-oss-20b was released on 2025-08-05 the cut off date is 2024-06-01. That’s a lot of time in IT for new library versions. Of course this can be fixed by extending the system prompt or via RAG but again it becomes a YOU problem.
On a side note I also had one time the case that the LLM insisted it didn’t know about a framework that seemed to be around for some time. After some queries it hinted that there is no framework but a library of that name but didn’t provide any details about it. So I replaced framework with library in the question and suddenly the LLM knew a lot about it. That was very pedantic for a system that normally has no problem to invent facts.

Which model to choose
When you go to the Ollama Model page you will find a huge list of LLMs you can download. If you look closely you can see tags like 4b or 8b. This is the number of parameters a model has where b stands for billion. If you don't know what the number of parameter means you can view it (really simplified) as the number of the brain power. The more the better.
I have tested models ranging from 1.5b to 30b. While this is a factor of 20 in difference all these models are considered as small models. Rumors say that GTP-4 has about 1.76 trillion (that is 1760b) parameter. So that's why 30b is still considered small. The 1.5b model I've tested was deepseek-r1:1.5b. This was by far the worst model I've used. It was very hard to get some correct feedback. Once I've managed to get it output incorrect sentences (grammatically and spelling wise). On the other hand you may have a use case where this is enough. But please don't use it to judge the power of local LLMs.
For 4b I've tested phi4-mini. This generates a lot of text that reads well and looks good. Unfortunately the number of incorrect facts/code is still too high to be usable. I didn't have the time how far I could get with careful prompt crafting and RAG but when I have to use a really small model I would try this one.
With 7b/8b the fun part begins. I've used several models in this range and here you get good enough facts/code. Of course it still has hallucination but at at a manageable level. And the big thing is, a 8b model has only on third of the brain power of a 24b model. You will notice this, because the 8b models only understood English or the language they were specifically fine tuned for. The ones that claimed to be multilingual spoke pretty worse German and switched back several times to English, which I count as not speaking German. In my opinion the 8b models also need more carefully crafted prompts and you have to scale down the complexity of the prompts more often.
The 24 models were the one what I have tested the most. Mainly the mistral-small3 family (including devstral). While they aren't perfect they work really well. The generated code has high quality and they are not only multilingual but they also can describe images.
For some short time (because they weren't available for long, when writing this) I've also tested gpt-oss:20b and qwen3-coder:30b. After a very few tests they seem to be at the same level of the mistral-small3.2 but please test yourself. Your mileage may differ.
While I have advocated to use the largest model I will now do the opposite. Because with the quality increase comes a resource usage increase. You need to have enough memory to run the models. When I've started a 24b model on my MacBook Air with 16GB RAM it was even slower than on the 10th gen Intel quad core CPU (yes c, like central) and there it was too slow. You need about 1GB of RAM per 1b parameter. So a 14bmodel could run on my Air but only if I didn't want to do anything else. So on the Air I'm limited to 8b models. On my Ryzen 5700G with 32GB RAM I could start mistral-small3.2 on the CPU (again a c) and it performed better that on the Intel CPU but was still a little bit too slow. And then OpenAI released gpt-oss:20b. While it isn't as fast as on my big MacBook (not the Air), it was fast enough on my Ryzen 5700G to be usable.
Enough about model size. Lets get to another fun topic: Licenses. All models listed on the Ollama site are open source models but the allowed usage differs. Some models have well known and commercial friendly licenses like Apache 2.0 or MIT. Others have their own. Own like you better ask a lawyer what your are allowed to do and what not. To get the to the license, select the model and if the model has different sizes, select the size that you want to test. The text "license" in the table is clickable and will you show you the full license text.
While you are at the model pages you should also look at the context column. Here you see how many tokens are allowed as input. While it is even hard to fill up 4k tokens by hand things change when you want to chat with documents. Here a larger context value means you can chat with larger documents.
And as if all those things weren't enough I have to add a few words about reasoning models. In theory they provide better output because they think first. In practice the model spills out large amount of texts that I don't care and lets me wait too long for my liking until I get the text I want to read. While you should be able to turn the thinking off, I only managed it for some deepseek models but not for all. That's why I prefer non reasoning models or models with low reasoning like gpt-oss:20b.

Context matters
"Context matters" is a phrase you hear with every LLM but my impression is that it is much more true with local LLMs. You really need to make sure that your first questions have enough context in it. Later in the chat you can go very sloppy and still get good answers. To understand this let me explain how LLMs work.
When you download a model the downloaded files won't change until you download an update or retrain them. Normal chat is no retraining. So every time you load the model into the memory you start with the same copy. But wait, aren't LLMs famous for remembering things from the chat? Yes that is true but only for the current session. Here is, how it works: If you start a session and send a question to the LLM you will get a response. For the second question though, you will not only send the second question but take the first question, the answer to the first question and the second question as one text and send the whole text to the LLM. So when you send the fifth question, you send a text that contains already four questions and four answers before your fifth question. That's what allows an LLM to follow a conversation. This has the following implications:

You don't need to start the text to the LLM with a question. You could just add some text before the first question to guide the LLM. That's the so called system prompt. But finding one that improves the LLM for all kind of questions is difficult and limited. That's why generic open source chat tools are inferior to specialized system prompts crafted for a known LLM by professionals. With specialized I mean that you can guide the LLM to do a task much better at the cost of doing other tasks worse. E.g. you can make it a good programmer but it will answer with program code were you wanted something else.
Keep your session focused. Don't ask for a cake recipe in a session about programming an android app.
If the session is going to get too long, the processing time may increase. If so, ask the LLM for a summary and start a new session where you put your first question below the summary.
Test different chat interfaces or search for prompts online. Different LLMs react different to the same prompt. There is a chance that a different system prompt will increase your results.
A good approach to start a session is to start a simple question that is easy to verify. But it needs to point to the real question. For a programming task this is asking for a hello world program. This is easy to verify but generates a lot of context for the next questions.

Use the edit button
OK, this is another implication of the last paragraph but I think it is important enough to have its own paragraph. When a person reacts wrong to what we said we tell how it was meant instead. Also it is tempting, don't do this with a local LLM. I had a really hard time to convince them to do the right thing afterwards. And even if I was successful in convincing the LLM it used the wrong output in later questions again. It turned out to be far easier to just edit the last prompt and try again. Also remember that LLMs have a temperature. This is a value between 0 and 1. The higher the more randomness you get in your answer. Which means asking the same question twice will result in different answers. Though the quality will not change from nonsense to high quality, it may add a very much needed keyword in the output.

Check out the Ollama CLI
Ollama is a server but has also a CLI with the same name. I had two occasions when the front-end programs failed and the CLI saved me. Once I wanted to use "few shot learning" and another time I wanted to have the summary of a large text file. In both cases I could solve the task with the Ollama CLI after the front-ends refused to work. Thereby I have noticed that the generic chat interfaces limit the context token length to something that is useful with any model and don't adjust the limit per model. When hitting limits, you need to adjust the limit manually to the value that the selected LLM supports.

Conclusion
I think local LLMs have reached state where they deserve to be tested. Yes, you need to learn a lot on LLMs and it will take some time. But a lot of that stuff is true for the big LLMs that you use with cloud providers. Understanding this will also help you when you go back to the cloud provider.