The technology behind local AI.

This subpage is for developers, IT decision-makers and technically interested readers. It explains the building blocks behind local AI: models, GPUs, inference, Whisper, RAG, dashboards, process management and hosting.

Looking for the shorter business explanation? Go back to Local AI for small businesses.

A local AI solution is not just a model, but a stack.

Reliability does not come from the language model alone. It comes from the combination of hardware, drivers, inference runtime, application layer, vector store, logging, process management, security and maintainability.

OS

Ubuntu Server LTS (22.04 / 24.04)

For AI workloads, stability matters more than the newest feature. Ubuntu LTS has the driver support and long lifecycle that fit infrastructure intended to run for years.

apt · systemdufw / nftablesunattended-upgrades
GPU-stack

NVIDIA driver + CUDA + cuDNN — matched

The most common cause of issues in local AI stacks is a mismatch between the driver, CUDA version and the version expected by PyTorch, vLLM or llama.cpp. We define one verified combination and pin it.

By default, we enable nvidia-persistenced to avoid driver startup latency during inference calls.

nvidia-driverCUDA 12.xcuDNNpersistence-mode
Inference

vLLM for throughput, llama.cpp for flexibility

vLLM is suitable for many parallel requests, continuous batching and tensor parallelism across multiple GPUs. llama.cpp is strong for GGUF models, flexible quantisation and smaller servers.

vLLMllama.cppAWQ · GPTQ · GGUF
Models

Open models selected per task

For Dutch documents and translation, Qwen, Mistral and Gemma often work well. The choice depends on context length, VRAM, desired latency and output quality.

For speech we use Whisper, where regional Dutch audio can sometimes perform better on a different model or fine-tune than standard benchmarks suggest.

QwenMistralGemmaWhisper
Web layer

nginx reverse proxy + Python/Node.js services

Inference-servers staan niet direct publiek. nginx regelt TLS, routing, buffering en streaming. Daarachter draaien FastAPI, Flask of Node.js services.

Where public reachability is needed, we prefer tunnels or reverse proxies over open ports.

nginxFastAPINode.jsSSE streaming
Management

systemd services with restart policies

Every component gets its own systemd service with clear dependencies, restart conditions and logging. After a reboot, the stack comes back in the right order.

systemd unitsjournalctlRestart=on-failure
Tuning

Small details make a big difference

Checking PCIe lanes, stabilising GPU clocks, tuning batch size to real VRAM usage, setting watchdogs and determining hallucination thresholds for transcription. These are project details that make the difference between a demo and production.

nvidia-smipcie gen checkbatch tuning

Choosing hardware based on task, volume and latency.

Hardware is a means, not the goal. First we determine what the AI must do, how many users are active at the same time, how large the context must be and which response time is acceptable.

Level I

Instap

One specific task, low to medium volume: transcription, local chatbot for a small team, document summarisation.
  • GPU1× 24 GB
  • Modeltot 13B Q4/Q6
  • Users1–3
  • PlatformWorkstation
Level III

Cluster

Multiple nodes behind a load balancer. Redundancy, failover, horizontal scalability and critical throughput.
  • GPUdatacenter-class
  • Model70B+ multi-instance
  • Users20+ / HA
  • PlatformRack-cluster

Document questions require retrieval, not just a larger model.

For company knowledge we often use RAG: documents are split, embedded, indexed and relevant passages are retrieved when a question is asked. The model then answers based on your own sources instead of general model knowledge.

01

Ingest

PDF, DOCX, TXT, EML and internal documentation are cleaned and split into usable chunks.

02

Embeddings

Text blocks receive vector representations that make semantic search possible, even with different wording.

03

Retrieval

When a question is asked, the application retrieves relevant passages and gives them to the language model as context.

04

Sources

Answers can refer to a document, paragraph or internal source, so verification remains possible.

Technical choices from real projects.

Forensic audio transcription — Dutch dialects

● In production

A transcription service for sensitive audio material where Dutch dialect and accent recognition is crucial. The pipeline runs fully locally and processes recordings in batch.

Model selection was not benchmark-driven only. On regional Dutch audio, another Whisper variant performed better in practice. Fine-tuning and post-processing made the difference.

PlatformUbuntu LTS
GPU24 GB class
ModelWhisper + tuning
Runtimefaster-whisper
Modeoffline batch

Bulk document translation — multilingual to Dutch

● In production

A translation service that periodically checks a network share for new documents. Non-Dutch documents are automatically translated and prepared.

Each document type requires a different extraction and reconstruction strategy. Those edge cases determine whether a solution becomes usable for end users.

Formatseml/pdf/docx/txt/xlsx
Runtimellama.cpp / vLLM
Servicesystemd
Interfacedashboard

E-discovery assistant — RAG chatbot on manuals

● In production

A RAG-based chatbot that helps legal-technical end users with questions about software manuals, release notes and internal work instructions.

In addition to retrieval, the persona layer matters: answers must match the language, workflows and knowledge level of the organisation.

TypeRAG + persona
VectorstoreChromaDB / alternatief
InterfaceWeb chat + API
Modeon-prem

An AI service without a dashboard is difficult to manage.

Every project includes a custom dashboard for what matters in that project: queue, progress, errors, uptime, GPU load, storage, latency, volumes and user actions.

A

Project-specific

A transcription pipeline needs different metrics than a RAG chatbot or translation service.

B

Operationally useful

Not just graphs, but concrete information a user or administrator can act on.

C

Integrations

Existing tooling can be integrated via APIs, so not everything runs separately.

D

Own environment

The dashboard runs on your infrastructure or on private Gold IT infrastructure, not as generic SaaS.

Enough technology?

The main page summarises local AI for managers and decision-makers. Use that page for the business case, benefits and concrete next step.

Back to Local AI for small businesses

From AI service to manageable solution

An AI service without a dashboard is difficult to manage. That is why Gold IT Services builds dashboards for AI projects covering queues, errors, uptime, GPU load, latency, volumes and user actions.

View a live example of an operational Gold IT Services dashboard.

Do you want the technical feasibility assessed?

Mark
Mark
Gold IT Services • Linschoten
E-mail info@golditservices.nl
Phone 06 49 75 54 50
Location Linschoten, Utrecht
Response within one working day