The LLM Revolution Was a Trillion-Dollar mistake, companies thought they could be profitable with the cloud but local might actually do it, especially when data sovereignty is involved.

Adam Longmire
Feb 6
9 min read

Introduction

I am sure many companies have been "using AI" to enhance their businesses but the reality is off-premise cloud has a lot of costs in terms of token count and cost to the environment. A small business can fully leverage a custom build "AI accelerator" system using off the shelf parts, and common open standard libraries. There is a misconception that AI and LLMs are completely useless they are if you lack the control over the system, but there are numerous ways to take that control back and one of those is local llms, these are exactly that local large language foundational models which can do many different tasks that match or sometimes exceed the most expensive AI plans so where do you as a company start with "local hosted LLMS" there is several ways to gets started one of those is to host an LLM locally on a single system instance specifically a workstation, the other is to host a single LLM on a shared server instance and lastly is to build a semi-distributed data center like architecture on site.

Cloud is not always the solution

Cloud computing platforms are nice when costs of Operational expenditure do not exceed the initial capital expenditure, but many tokenization "plans" for large language models can cause significant bottlenecks in how you use an LLM it also removes the general control over your data. While cloud computing "can" offer benefits in terms of high availability, and other capabilities this is not always going to be a "good thing" for example we have had multiple cases where single pieces of code have taken down entire networks, due to insufficient redundancy and single points of failure we have had at least 3 - 4 examples of Amazon AWS being taken offline basically crippling the internet on premises AI inference is in general going to potentially be safer, especially if you leverage the best of hybrid clouds where you have your inference engine onsite and some data access and assets off premise, but in cases where you need "full" control over your business, or companies data running a cloud instance is not the best choice, yes there is more responsibility placed on the hosting hardware but a well designed robust network, hardware choice, disk redundancy and power delivery an local AI workload can easily outperform a cloud based one initial capital expenditure will potentially pay itself off fairly quickly, some cloud services such as AWS instances high performances can cost as much as $8,000+ USD per month you could have bought a personal cloud unit for at least that and had it paid for within the month.

Companies want you to hire the compute from them, and then charge you the premium for your own data, doing so locks you into their ecosystem which is potentially inflexible and can have prices skyrocket as result, leading to your AI powered app while being cheap initially issuing thousands of token requests every day across multiple team members will quickly lead to a premium that will cripple your business using on premises you do not have this worry.

Hardware

In most cases the hardware for an LLM is in fact commodity level in terms of using it, for example many AI workloads use large amounts of memory. To achieve similar reliability levels as a normal cloud hosted LLM you design multiple layers of redundancy and reliability but tokens when you have access to your own hardware are NOT limited you can use as many tokens. so what hardware do you need to run a locally hosted LLM?

CPU - A high performance server CPU is typically desired but you can easily get around that requirement by using standard desktop CPUS, they can do everything a server grade can do but are often cut down, but can provide acceptible levels of generative performance.
RAM - This is in two parts the VRAM and the system ram, in LLM the ram is consumed by both the context memory which is the current conversation with the LLM and overall number of "chats" being used more chats will take up more space. RAM on a production local hosted system may need considerable memory >2TB while this may be seen as a problem and initial CapEx might be high the benefits to hosting a LLM are incredibly high.
Storage - LLMs can be costly in terms of space and resource utilization in a shared "content" where clients are sharing a single "chat space" this kind of over use is possibly mitigated but giving each user their own 'context space' can quickly consume resources.
Redundant power delivery - In the event of loss of mains power robust power delivery is critically important to a local LLM host platform.

Custom LLM Server unit specifications

So what would an actual custom LLM or AI inference server look like in terms of hardware. It depends if you purchase it direct from Nvidia it could be considerable cost expenditure, but you can start reliable local LLM performance with as little as 64GB of DDR4 or DDR5 RAM, a single consumer GPU can be used as well with 24GB, which provides a fairly large context window for working with tasks. If you decide to use server grade GPUs they will require considerable cooling, as nvidia gpus do not come "loaded" with cooling capability in servers they typically use "ultra high volume / high RPM fans" (Careful they bite!)

A full scale LLM server

CPU - AMD Ryzen Threadripper PRO 7985WX 64-Core sTR5 Unlocked CPU Processor
RAM - Corsair WS 256GB (8x 32GB) DDR5 ECC RDIMM 5600MHz CL40 AMD Desktop Memory
GPU - NVIDIA RTX 6000 96GB GDDR7 Professional Video Card - Blackwell Server Edition
Motherboard - ASUS Pro WS TRX50-SAGE WIFI sTR5 CEB Motherboard
NIC - QNAP 2-Port (10Gbase-T) 10GbE Network Expansion Card (bond them together to provide 20Gbps of raw bandwidth)

Software

One of the biggest benefits of local LLMs is you have sole authority about how it runs and how it operates, you are able to tweak every setting, and every configuration to improve performance and manage resources, you can apply your own security engineering and other protective measures on it. In the most basic sense you can use off the shelf solution like those found with Ollama and Llama cpp but you can also leverage standard apis provided by many LLM and AI model hosting platforms one of those being "Hugging face"

Single agents

These are mostly what ChatGPT is, this is a single instance with a context memory for a particular users, Single agents can perform numerous tasks such as programming, poem writing, debate training and other techniques. Single agents can often have "knowledge" database and they can also employ "many augmentative" capability such as being able to perform DAG (Document Augmented Generation) and RAG (Retrival Augmented generation). Single agents are single user systems "usually" but in terms of the backend operation sometimes to save compute resources "routing" of less expensive queries are done to lower power instances, orchestrating agents can be done by "routing users" via docker containers or other containerization hosting infrastructure, again all of this can be setup locally on a server, docker instances are fired up and passed to "connecting" endpoints such as a businesses workstations.

Cooperative Agent use (Multi-agents)

Cooperative multi-agents are those that can cooperatively work together, that is individual agents "create instances" of them selves to server particular tasks, and those tasks allow them to solve problems, this can include the use of multi-agents for various tasks some may include "code inspection" such as unit-test analysis, others can be a dual task system one that pushes updates and then checks those updates for consistency. Mutl-agents can also include 'team cooperative members' that is people in a "team" who speak to an agent as a group bouncing ideas back and forth, this is also known as a "shared context environment"

Agents of this category also share traits with human social systems, too many agents operating on a specific task can cause "noise" and can run the risk of a non-convergent solution especially if the planning and execution of task require a set of specific steps and operations to ensure it completes the task correctly. These agents also have to take into consideration latency, reliability, performance and efficiency. If an agent spends a large amount of time taking up vital resources being used by a "workflow" other tasks running on an instance may suffer as result.

Agent Resources

LLM and AI agents in general can be used to access resources, one of the predominant ways this is done is via tool usage, tools allow for LLMS to go beyond just text and generative capabilities and allows them to directly interact with disparate systems, this can include databases, file shares, documents, systems, cloud platforms. LLMS can be used to orchestrate and control many of these elements. For example a financial "analysis agent" may be asked. Tools can be made reusable, just like all good software design paradigms making the tools reusable and even setting up 'cached' quick access resources can be an efficient way of working, why not temporally "store" cached data in the easiest to access region of storage if it's constantly used.

Common agent resources

Databases - Graph, SQL, NoSQL, Document, Time Series, and others.
Document Knowledge bases - Used in RAG retrieval for some tasks.
Knowledge repositories and other resources
Tooluse and access
Augmented browsing - Downloading data from websites analysing that data this is also a type of retrieval augmentation.
File systems, code bases, and other resources
If an agent is capable of audio, video or images it can also access this information.

The amount of resources an agent may have access to can be considerable which is why it is imperative security being a deeply considered factor. Data should not be completely "open to full access" gradated levels of access should be employed at all times, otherwise what occurred recently where a database was deleted by an LLM can occur. To prevent unauthorized changes you may employ a number of protection measures including dedicated llm roles in databases and access control lists to permit or not permit certain operations.

For example a database should never be directly accessible to operations that lead to application data destruction a "layed" approach borrowed from OSI models building a set of layered controls and accesses on this "system ensures" that risks on operations that models may take are both constrained and logged.

Access Control Structure

An example is a IaC managing agent should have for any instance of an IaC platform perform checks and verification at each phase and this would look like a stacked diagram.

Low risk: Involve operations that would lead to no harm or minimal harm if they were incorrectly performed, maybe minor annoyances.
Medium risk: operations that can lead to data modification or corruption
High risk: Operation that could lead to complete data destruction, deletion, or failure of critical infrastructure. Modification of existing rules, that reduce security, or protection of data may involve IaC.
Very high risk: Operation could lead to physical, damage, privacy violation or mass data compromise this would likely be disallowed by any sensible company. In the context of IaC this might be automatic deployment of "firewall rules" which may lead to complete network exposure.

Configurations done by agents should be highly restricted and depending on the particular field of application an agent uses hardening of permissions and policy structure to disallow violation of data or practices ensuring agents remain stable, reliable and secure. These a problems many companies do not realize and there is many ways to enforce this, including creating 'agents' with varying access control policies which can be enforced on a per-system basis, per-service or per-application, this enforcement allows tight granular control over what can and can not be access and how it must be accessed, including agents that must request "higher access to perform" certain operations.

Example for what a query might look like and how tools might be used Query from User "Create a projected analysis of the current costs with CMPYN(Company Name) project the next possible cost cycle using fractional differentiation and calculate the linear regression on that fraction analysis" Response from LLM agent

Okay LLM says I need to get the current stock value for CMPYN then I need to use a tool to perform "fractional differentiation" on the rate of change, this will likely require a call to a tool that does this in the case of python for example library fractdiff with matplotlib it will go ahead and lookup company name, get current data points, and run linear regression on the time period. LLMS response also includes a "graph" of the projected line for likely where the value of a stock in the next day".

CPU based Inference

Is the most simple to setup as it typically does not require any special libraries however this can severely limit the performance of your inference workloads, so CPUs are often good for developers testing out LLMs or LLMs that currently lack GPU acceleration, CPUs are not suited to LLM inference and GPU are technically not either, they use a considerable amount of power for what they do. In CPU based inference you also can only use RAM as the storage medium for context memory which will reduce the overall performance.

GPU based inference

Graphics processor powered inference is often superior due to not requring as much system RAM, as the GPU may have a large cache of space already available in this case it allows for faster inference speed by running the MAC (Multiply and accumulate tasks at scale) allowing for them to take advantage of the performance of a GPU.

Agnostic Hardware operation

We do not always need actual GPU or CPUs to do hardware accelerated inference. There is a number of application platforms that have allowed the development of hardware agnostic operation one of those is the use of Apache TVM this is a machine learning cross compilation platform allowing for ML workloads from commonly used libraries like TensorFLow and Pytorch to be converted into a completely operationally agnostic form allowing for wider ML applications beyond some of the restrictive choices given by GPUs and CPUs.

LLMs were meant to be a revolution they still can be, done in the right manner that augments prople but doesn't replace them, this is the future envisioned in the ideas of great science fiction stories like Startrek, and thats how AI should be used not as a replacement but as a system that makes us better, workers, people, thinkers and team members. The local LLM revolution is now learn about it and deploy it today, your data, your privacy, your control, your choice.

Ironhide Rhino Systems