What are Local Large Language Models? How do you use them?

Adam Longmire
Dec 15, 2024
14 min read

Updated: Sep 10, 2025

Introduction

Large language models are machine learning algorithms typically trained on a vast amounts of data commonly scrapped which means scanned and fed into them to create human like responses, but did you know you actually don't need an account with OpenAI or any other cloud provided large language model and in fact run it locally from your own computer without paying a cent to OpenAI or other Software As A Service - SaaS systems. There is typically two ways to run large language models, one of the easiest is to use a CPU, now a CPU can perform incredibly well but most LLM systems, will require a very heafty CPU something like an AMD Epyc or AMD Threadripper, but it depends typically more cores the better a locally hosted LLM will perform. However the best option with runnning a local model is by using a GPU, GPU offloading can improve performance right across the board in terms of model inferencing speed.

Large langauge model modes

Large language models during the phase where they are ingesting data to be "trained" this process involves modifying weights and biases, what these are are mathematical structure used to represent the underlying generalisation of the model. And when this is happening a model is in what is known as "training mode" objectively training model requires obscene amounts of "compute time" for example ChatGPT-4. Required over 90 to 100 days with each system taking up 2100 hours per system. * 25,000 GPUs. See: (https://seifeur.com/gpt-4-training-time/) the other mode is called inferencing mode this is typically how most LLMs are used in their primary state, in inferencing mode you provide a "prompt" that prompt is then read in via tokenization, and quantification of the data in the "tokens" and then a probabilistic output on the input given. To summarise LLMs operate in two modes.

Inferencing - Model is being used to generate tokens (letters, words, characters), this is how you come with ways to use an LLM, prompt engineering.
Training - Model is being actively fed data from the extremely large corpus of data to train the model, in this mode is also the process of REHLF (Re-enforcement by Human Feedback) which is a reward / punshiment system where certain values are given scores to reward the model for doing a good job and punished to discourage the behaviour or output present, such as case would be using REHLF to remove from the likely emergant property of spouting racist garbage due to the nature of how these are trained they basically suck up the internet which inherently leads to unsavoury places being also accumulated in the datga set.

How do I prompt engineer to I get what I want out of an LLM?

As much as companies are interested in prompt engineers this isn't a particularly difficult subject, prompt engineering is a type of alignment solution or align specification is the word I use to describe it as, just as a model is trained on a giant corpus of data, the more detailed your prompts are and the more specification given the better the models outputs, there are "lazy" hacks lets call them to use keywords of known information, but this just dirty and limits your true capability with an LLL, prompt engineering is simply defining sufficient detail and "alignment specification" to give the output you want, it is not unlike the skills writers use when describing the story, description of characters, and the world buiilding the detailed the world is the better an LLM performs as they perform best when given lots of 'data' to work with.

Essay Example

Lets give an example of prompt engineering and how to protect yourself by defining a sufficient guard rail to analyse an idea, as LLMs are sycophantic by nature, they will agree with what ever you say this is dangerous, and to avoid it you should include, in your prompt.

Lets say you want to write an Essay for school, using an LLM to do this for you is lazy and rife with risks, but there is a way to get a good way to write an Essay. An example prompt engineer might be the following. "I need to write an Essay on the environmental impact of plastic bags, provide my some ideas, and provide objective reasoning and justification why do to write the Essay for me and provide some ideas including less known ideas"

This here is a good prompt because you are providing a decent amount of limiting specification on another side note this is actually related to artificial general intelligence safety alignment which you can see here (Connection Between Technology Complexity Safety and Risk ) Once an appropriate specification is defined you can then begin refining it further over successive iterations of prompts which form the basis of further specification definition.

Scenario Gaming

Another good example is what we call wargame modelling or scenario gaming, you can use this prompt technique to define parameters for scenarios, this could be anything from an emergency to law enforcement, to possible mundane analysis of everyday tasks.

Prompt engineering in scenario gaming is not different to the the descriptive techniques of language writing, and how writers go about doing it, an example
"Given a scenario of an individual presents a set of symptoms in the emergency room, generate a set of possible candidates of what might be wrong with the individual", this sort of gaming really should only be done by doctors, but none the less IF you do this kind of gaming, you MUST keyword MUST be able to verify any outputs IF you can then you are OKAY if you can't then do not use the advice given I cannot stress this enough (CRITICAL THINKING MUST BE USED IN HIGHSTAKES APPLICATION), LLMS will hallucinate and will lie to you" SO IF you are a training nurse, doctor or anything else check your textbook against what the LLM has said YOU HAVE BEEN WARNED. I WILL NOT BE HELD RESPONSIBLE FOR THE CONSEQUENCES OF THE WRONG USE OF THIS INFORMATION

Programming Example

Please do not "vibe code" with programming this is a dirty word and if you cannot already program, make sure you get an LLM to explain the code it generates, be cautious of numerical values it uses in the code especially in unmanaged code languages like C or assembly as they have no safety code guard rails so if you inadvertantly introduce a vunerability like a buffer overflow it can cause all sorts of problems, and even crash your application, leak memory or just plain cause undefined behaviour iike if a memory region in C is not initialized it will result in "garbage" being at that memory region what ever that looks like.

Note - LLMs are best used with supporting critical thinking to leverage their effectiveness to their highest potential as long as you do this these make phenomenal tools to support, business, education and anyone, LLMS are not replacements for workers they are force multipliers.

Options for running large langauge models

There is many options to run large language models ranging from the very simple LM studio to the advanced APIs which mimic OpenAI's end points like LangChain, allowing you to develop a backend server like system to communicate with an LLM.

LM studio - A simple platform with an inference as well as giving you tight fine control over the entire system, including attention, context memory, GPU offloading, model selection, and other various engines to run the system on including CUDA and Vulkan libaries. (I am particular biased toward LM studio I have tried others and prefer LM studio for it's flexibility, configurability and builtin OpenAI like server you can interact with via command line when running LM studio in the "Developer mode", if you are particularly adventrous you can run an LM studio instance in your home combined with a WireGuard VPN connection and access the terminal of your LM studio instance from anywhere in the world providing FULL data sovereignty over your entire data.

Idea: Home based ChatGPT instance

Give you have access to a server with Windows server 2019 or higher you can easily run your own ChatGPT server via LM studio then VPN into your home network, and you now have a fully private form of ChatGPT no one will ever be able to access, WireGuard itself is very secure 1 due to it's small tiny code base and two it REQUIRES Publickey / Privatekey cryptography to work, an WireGuard endpoint will ignore all authrequests unless they are in it's private-key database no private key no connection simple as that. If you are a programmer building a small terminal based access system where you can login to your server via PowerShelll, Bash or any other shell interface could also reduce the amount of bandwidth you require, allowing you to setup a simple print evaluate loop, and add in simple command line request command line "Delete Chat etc"

Jan - A fully open-source chat interface giving a close to chatgpt like experience however one particular problem with Jan is it does not have as many selection choices for running models as quantized form, which will make it's ease use more difficult, compared to LM studio which gives you absolute control.

Ollama - Ollama also has many of the features both Jan and LM studio have however again they have a limited catelogue of accessible models making it not the best option Ollama is also more difficult to configure.

Model Inferencing Performance Optimization

Models are all subject because they are in a computer to optimization and resource constraints based on the resources available, the more resources availability the less heavily optimized models have to be, it also depends on other parameters such as, context memory and flash context memory. Optimizing a model involves looking at multiple facets including quantization, hardware considerations, and other configurable options inside many local LLM hosting systems.

Processing unit

Models can typically be run on the CPU, GPU or CPU + GPU. In the scenario where you are only using the CPU, a LLM will only have access to your RAM, and CPU this means model data will be stored in this area and will likely have token generation speed reduction, even with the fastest consumer GPUs when you are likely to achieve close to GPU speed you'd be looking at as stated before enterprise and workstation grade processors, such cases are AMD Epyc CPUs and Threadrippers with the 64+ cores, in this case an LLM is likely to perform better, than when using a consumer grade or prosumer grade CPU. Alternatively GPUS or "graphics processor units" are suited very well to accelerating the performance of token generation, the beefer the GPU the better that performance, as GPUs internally in their Gigathread engine (common nvidia term) can parallelize millions of threads unlike the CPU which maybe limited to a certain number of executable CPU threads.

Apple consideration - UMA (Unified Memory Architecture)

One other thing to note, Apple machines are very much capable of running LLMs, especially since there is no delimiter between the type of memory used, as Apple devices borrow from the foundations of mobile devices called UMA (Unified Memory Architecture) often being obscenely faster than both GPU memory and CPU memory. Apple devices especially MacBook Pros can run LLMs so can Airs, however you would need at least 16GB to 24GB of UMA memory to get good performance with a Mac machine.

Quantization - This is a fairly simple concept in a computer we have "boxes" these are a 'certain size' and those boxes can fit a certain sized smaller box, into that memory block, if the box is larger it can fit more into it if it's smaller it can fit less into it, smaller boxes take up less space than a larger one, this is not unlike how data is stored with model parameters in a LLM. In this sense numbers on a computer are stored in a finite number range, and each 'data type' of an LLM is used to give to define the weights and biases in the model. A F64 storage data type which is often a double in C++ is 64bits of precision most models do not need this level of precision so most of them default to FP32 or 32bit floating point representation, a FP32 bit which is 4 bytes of data will potentialy take up on the input layer of say a 150 Billion parameter is 150*10^9 * 4 byes = 90,000GB (potentially for the input layer

Each quatization can potentially reduce the models overall space occupation and improve memory efficiency and model inference speed. The typical values in models are the following

FP32 (Floating Point IEEE 754 32bit)
FP16 (Floating Point IEEE 754 16bit)
INT32 (Integer representation 32bit integers)
INT16 (Integer representation 16bit integers)
INT8 (Integer representation 8bit Integers)
INT2 (1 or 0) (bit based model)
INT 1.58 (new quantization which is encouraging to be run on a mobile phone potential for pocket LLMS ~878MB of memory or less) (Some model runner systems can't use bitllm.cpp yet developed by Microsoft)

Quantization can lead to model accuracy performance reductions, however, this may not always be the case, test out different configurations, in addition to this some models have flash memory or context memory quantization which is the actively stored current conversational data "held in context memory" this memory can be quantized as well as the model already being quatized which improves the memory utilization leading tgo both model performance speed up, and space complexity reduction (Fancy word to describe how much space a program takes up)

Model pruning - Not typically done by the user, but some model cards will have information about pruning what this does is "disposes of model parameters" which do not contribute to the overall function of the neural network, that is when we learn in our brains as we get older, our brains perform something called 'neural pruning' that involves getting rid of non-contributing connections to the storage of information, by doing this the brain frees up resources for development of other neural structures, machine learning models like LLMs are much the same. as this.

Context Memory

Is crtiical to the operation of any LLM this is where your current conversation is stored, and the context memory is typically isolated between chats, however shared global context memory implementations have been considered as a possible idea, however I have yet to see this implemented. Bigger context memories lead to larger RAM consumption both in VRAM and system RAM, if your model starts going over the VRAM allocation size of your GPU the system RAM will start to be used as context memory and the application will play a game of swapsies where VRAM contents if it's not currently used will be swapped out to the normal RAM as you do not need to access it directly.

Model Choice

Another consideration you must take into account is how models are trained, and by who and how, some models have builtin in biases OR model parameter output restrictions, many people love deepseek however they do not realise that deepseek contains "naughty word restrictions" and "concepts" I'll let you dig into why that is, and why it is a problem but involves certain foreign governments in the process of specifying the advesarial data when it was trained, try our different models and also be mindful most of the top models are 'safe' but be careful with models which on on HuggingFace the really "unknown models" may contain malware you have been warned, stick to Mistral, Deepseek, LLama, Codestral, Mixtral or, others including Dolly. Steer clear of small amounts of "stars and review" models to protect yourself from the malware risks however if you have a good antimalware platform it 'should' detect packed data which might be malicious, please as always be careful what you download and where. Model choices I use for me I use both Mixtral and Mistral as they are both "mixed use models", however coding specific models may perform much better with coding related tasks. To get started here is couple of choices to try. Mistral coding; https://lmstudio.ai/models/mistralai/codestral-22b-v0.1 Mistral mistral improved: https://lmstudio.ai/models/mistralai/mistral-small-3.2 Good mixed Balance: https://lmstudio.ai/models/mistralai/mistral-nemo-instruct-2407 Mistral Mixed Experts: https://model.lmstudio.ai/download/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF Deepseek: https://lmstudio.ai/models/deepseek/deepseek-r1-0528-qwen3-8b

Interesting notes

LLMs do have emergent capabilities most people may not know or understand, some of the ideas I've given are only a small subset of what can be done, while they do hallucinate, and make errors as well as lie, models can be prompted into a limiting specification for alignemnt allowing for them to be made more accurate, however you really cannot trust them when they do calculations as they make errors, while that maybe the case, IF you have sufficient knowledge about the subject in question, you should be able to look at it and go "Oh thats not right" One other word of caution because models have been trained on effectively the entire internet they do also have other undesirable traits, one of those, giving you links to malicious websites, and other security threats, so always check the URLs indirectly do not copy paste the links as they can potentially be contaminated, it's what happens when these models are trained on everything and there is just not enough advesarial examples to prevent them from providing either errornous, dangerous or risky information.

Using LM studio

When you first start up LM studio you will be greeted with an interface, Here you will set the types of settings you want, On the left is the "chat box" currently highlighted below that the developer panel you will only see these advanced settings in Poweruser mode, or developer mode because I love control I use developer mode it provide a lot more precise control, during user mode you get a very simple interface interface configuration complexity goes from Users -> Power User -> Developer Each is increasing in complexity than the previous but to be able to search for the "model you want" Power user is required.

Below the chat window is Developer terminal the green "terminal icon" as a Power user you probably won't use this, but if you are learning to code running this and then setting up python can be a useful less in JSON queries and the responses LM studio provides VScode or other IDEs can do all this for you.

Finding models to run In LM studio the purple icon on the left the 'search magnifying glass" is where you look for models to run. As you can see there is many models at the top, safe picks are most often the safest, if it's a staff pick it's not only good but "safer" than others, also pat attention to the icon that says "FULL GPU OFFLOAD" if you have an RTX GPU or other GPU GPU offloading speeds up token generation significantly you can see my RTX 3090 can pump out 67.89tk/s tokens a second. making this very viable, if you have a RTX 5090 or RTX 5080 even more so, or an RTX 4090. Download options is where you can choose quantization for thje model you want, once you choose it simply hit the "download button" green button on the left hand side.

Lastly we need to configure and make sure we enable full GPU offloading from the power user options. This is all your advanced configuration option and where you can control the optimization of models, before you load a model tick the boxes with "K cache quantization" on and V cache optimization on" Choose a quatization that works wel for you, feel free to experiment, F16 is as discussed above Floating point 16bit or "half precision" but if you need to save more memory go lower, context memory is also another consideration, once context memory is "full indicated" by the bottom left side says 11.3% is how much context memory is used for this session, so when you are solving problems, with LM studio it's best to take notes to keep track of your ideas if you are bouncing them back and forther, one you refine your prompt sufficiently create a new "window chat" provide the specification of your refined idea, and continue refining it, to best make use of context memory efficiently.

There is safety guard rails to ensure there is suffcient resources available for the windows machine or Linux for the operating system to keep functioning if you disble the safety guard rails you could crash your computer. Those guard rails are there for a reason, you can change it between Relaxed, Balanced and Strict, I find balanced is good, but if you are memory constrained feel free to set it to Strict. Accessing this is done from the bottom left hand corner with the "gear cogs icon"

Mission Control and App configuration settings, you can also change the style of the return prompts and theme, for me this is a dark theme I love dark themes as they are great for code highlighting.

That's all there is to it, enjoy your local large language model and giving a middle finger to AI companies who scrap your chats and interactions to make their models better, I personally prefer to use Mistral as their models are the least, whats the word malicious? As they are French company more likely to conform to ethical parameter settings that are reasonable without being a nanny.

Ironhide Rhino Systems