Technology behind Local AI

/ Architecture

A local AI solution is not just a model, but a stack.

Reliability does not come from the language model alone. It comes from the combination of hardware, drivers, inference runtime, application layer, vector store, logging, process management, security and maintainability.

Ubuntu Server LTS (22.04 / 24.04)

For AI workloads, stability matters more than the newest feature. Ubuntu LTS has the driver support and long lifecycle that fit infrastructure intended to run for years.

apt · systemdufw / nftablesunattended-upgrades

GPU-stack

NVIDIA driver + CUDA + cuDNN — matched

The most common cause of issues in local AI stacks is a mismatch between the driver, CUDA version and the version expected by PyTorch, vLLM or llama.cpp. We define one verified combination and pin it.

By default, we enable nvidia-persistenced to avoid driver startup latency during inference calls.

nvidia-driverCUDA 12.xcuDNNpersistence-mode

Inference

vLLM for throughput, llama.cpp for flexibility

vLLM is suitable for many parallel requests, continuous batching and tensor parallelism across multiple GPUs. llama.cpp is strong for GGUF models, flexible quantisation and smaller servers.

vLLMllama.cppAWQ · GPTQ · GGUF

Models

Open models selected per task

For Dutch documents and translation, Qwen, Mistral and Gemma often work well. The choice depends on context length, VRAM, desired latency and output quality.

For speech we use Whisper, where regional Dutch audio can sometimes perform better on a different model or fine-tune than standard benchmarks suggest.

QwenMistralGemmaWhisper

Web layer

nginx reverse proxy + Python/Node.js services

Inference-servers staan niet direct publiek. nginx regelt TLS, routing, buffering en streaming. Daarachter draaien FastAPI, Flask of Node.js services.

Where public reachability is needed, we prefer tunnels or reverse proxies over open ports.

nginxFastAPINode.jsSSE streaming

Management

systemd services with restart policies

Every component gets its own systemd service with clear dependencies, restart conditions and logging. After a reboot, the stack comes back in the right order.

systemd unitsjournalctlRestart=on-failure

Tuning

Small details make a big difference

Checking PCIe lanes, stabilising GPU clocks, tuning batch size to real VRAM usage, setting watchdogs and determining hallucination thresholds for transcription. These are project details that make the difference between a demo and production.

nvidia-smipcie gen checkbatch tuning

/ Hardware

Choosing hardware based on task, volume and latency.

Hardware is a means, not the goal. First we determine what the AI must do, how many users are active at the same time, how large the context must be and which response time is acceptable.

Level I

Instap

One specific task, low to medium volume: transcription, local chatbot for a small team, document summarisation.

GPU1× 24 GB
Modeltot 13B Q4/Q6
Users1–3
PlatformWorkstation

Level II

Productie

Multiple tasks, daily use and multiple simultaneous users. The sweet spot for many organisations.

GPU2–4× 24/48 GB
VRAM48–192 GB total
Modeltot 70B quantized
Users5–20

Level III

Cluster

Multiple nodes behind a load balancer. Redundancy, failover, horizontal scalability and critical throughput.

GPUdatacenter-class
Model70B+ multi-instance
Users20+ / HA
PlatformRack-cluster

/ RAG

Document questions require retrieval, not just a larger model.

For company knowledge we often use RAG: documents are split, embedded, indexed and relevant passages are retrieved when a question is asked. The model then answers based on your own sources instead of general model knowledge.

Ingest

PDF, DOCX, TXT, EML and internal documentation are cleaned and split into usable chunks.

Embeddings

Text blocks receive vector representations that make semantic search possible, even with different wording.

Retrieval

When a question is asked, the application retrieves relevant passages and gives them to the language model as context.

Sources

Answers can refer to a document, paragraph or internal source, so verification remains possible.

/ Projects

Technical choices from real projects.

Forensic audio transcription — Dutch dialects

● In production

A transcription service for sensitive audio material where Dutch dialect and accent recognition is crucial. The pipeline runs fully locally and processes recordings in batch.

Model selection was not benchmark-driven only. On regional Dutch audio, another Whisper variant performed better in practice. Fine-tuning and post-processing made the difference.

PlatformUbuntu LTS

GPU24 GB class

ModelWhisper + tuning

Runtimefaster-whisper

Modeoffline batch

Bulk document translation — multilingual to Dutch

● In production

A translation service that periodically checks a network share for new documents. Non-Dutch documents are automatically translated and prepared.

Each document type requires a different extraction and reconstruction strategy. Those edge cases determine whether a solution becomes usable for end users.

Formatseml/pdf/docx/txt/xlsx

Runtimellama.cpp / vLLM

Servicesystemd

Interfacedashboard

E-discovery assistant — RAG chatbot on manuals

● In production

A RAG-based chatbot that helps legal-technical end users with questions about software manuals, release notes and internal work instructions.

In addition to retrieval, the persona layer matters: answers must match the language, workflows and knowledge level of the organisation.

TypeRAG + persona

VectorstoreChromaDB / alternatief

InterfaceWeb chat + API

Modeon-prem

/ Dashboards

An AI service without a dashboard is difficult to manage.

Every project includes a custom dashboard for what matters in that project: queue, progress, errors, uptime, GPU load, storage, latency, volumes and user actions.

View a live example of a Gold IT dashboard →

Project-specific

A transcription pipeline needs different metrics than a RAG chatbot or translation service.

Operationally useful

Not just graphs, but concrete information a user or administrator can act on.

Integrations

Existing tooling can be integrated via APIs, so not everything runs separately.

Own environment

The dashboard runs on your infrastructure or on private Gold IT infrastructure, not as generic SaaS.

/ Back to overview

Enough technology?

The main page summarises local AI for managers and decision-makers. Use that page for the business case, benefits and concrete next step.

Back to Local AI for small businesses →

From AI service to manageable solution

An AI service without a dashboard is difficult to manage. That is why Gold IT Services builds dashboards for AI projects covering queues, errors, uptime, GPU load, latency, volumes and user actions.

View a live example of an operational Gold IT Services dashboard.

The technology behind local AI.

A local AI solution is not just a model, but a stack.

Ubuntu Server LTS (22.04 / 24.04)

NVIDIA driver + CUDA + cuDNN — matched

vLLM for throughput, llama.cpp for flexibility

Open models selected per task

nginx reverse proxy + Python/Node.js services

systemd services with restart policies

Small details make a big difference

Choosing hardware based on task, volume and latency.

Instap

Productie

Cluster

Document questions require retrieval, not just a larger model.

Ingest

Embeddings

Retrieval

Sources

Technical choices from real projects.

Forensic audio transcription — Dutch dialects

Bulk document translation — multilingual to Dutch

E-discovery assistant — RAG chatbot on manuals

An AI service without a dashboard is difficult to manage.

Project-specific

Operationally useful

Integrations

Own environment

Enough technology?

From AI service to manageable solution

Do you want the technical feasibility assessed?