The M4 Mac Mini as a Local LLM Workhorse
A month with a quiet box
I promised to document my experience — here it is. I ran a base-spec M4 Mac Mini (512 GB internal SSD, 30 GB unified memory) with a 4 TB external SSD as both a desktop and an OpenAI-compatible inference server. Spoiler: it is quiet, frugal, and strong enough for day-to-day private AI workloads.
Apple has not published granular GPU memory limits. Activity Monitor shows roughly 22 GB addressable by the GPU. Your mileage may vary.
Why local LLMs?
Three reasons, in order of importance:
- Data sovereignty and GDPR — nothing leaves the building. No data processing agreements, no third-party sub-processors, no transfer impact assessments. The data stays on a box you own.
- Cost certainty — no usage-based billing. One hardware purchase, then electricity. At 30 watts sustained load, that is roughly the cost of a desk lamp.
- Air-gap option — disable outbound traffic entirely and you have an inference server that cannot leak data even if the software tries.
Telemetry caveat: Ollama and LM Studio collect anonymous usage statistics unless you explicitly disable them. For Ollama, set OLLAMA_DISABLE_TELEMETRY=1. For LM Studio, go to Settings and turn off telemetry. If you are running this for data sovereignty reasons, do this first.
Boot options: internal vs external macOS
I installed macOS on a USB-4 4 TB SSD to hot-swap configurations — one for production, one for experiments.
- Pros: instant rollback, sandboxed experiments, no risk to the production environment
- Cons: macOS updates can block the external drive if Secure Boot is set to Full Security
Fix: Reboot, hold Command-R, open Startup Security Utility, set Reduced Security and allow booting from external media. Booting internally avoids this entirely.
The toolchain
Ollama — fast CLI, good model library. Set OLLAMA_MODELS=/Volumes/LLMRepo to store models on the external SSD and OLLAMA_HOST=0.0.0.0 to expose the API to your LAN. Watch for silent num_ctx mismatches when you change context size — even re-configured models can ignore the setting.
LM Studio — GUI plus OpenAI-compatible API in one application. My preferred choice for experimentation and interactive use. To bind to all network interfaces, edit ~/.cache/lm-studio/.internal/http-server-config.json and set "networkInterface": "0.0.0.0".
llama.cpp — lowest overhead, most scriptable. Compile with make LLAMA_METAL=1 LLAMA_METAL_EMBED=1 for full Metal GPU acceleration.
All three expose an OpenAI-compatible API, which means any tool expecting an OpenAI endpoint — CrewAI, LangChain, custom scripts — works without modification.
Performance: what actually runs
Measured with llama-bench on the base M4, batch size 1, 4096 context, Metal backend:
| Model | Quantisation | VRAM | Tokens/s | Power |
|---|---|---|---|---|
| Mistral 7B Instruct | Q4_K_M | ~10 GB | 54 | 28 W |
| Qwen 14B Chat | Q5_K_S | ~19 GB | 25 | 29 W |
| Llama 3 24B | Q4_K_M | ~22 GB | 12 | 30 W |
The sweet spot on the base M4 is 7B to 14B models. At 7B you get conversational speed. At 14B you get noticeably better reasoning at still-usable throughput. Above 24B the context window or model weights spill into host RAM and responsiveness drops sharply.
For the M4 Pro (36 GB or 48 GB configurations), the ceiling moves up significantly — 30B+ models become practical, and the extra memory bandwidth makes a real difference for longer contexts.
System characteristics
- Power: 6 W idle, 30 W sustained load
- Noise: below 20 dB(A) — the fan stays under 2000 RPM. In practice, silent
- Footprint: 19 cm square — fits under a monitor stand
- Cost: under 1500 EUR for the base configuration with external SSD
This is less power than most laptop chargers draw. It runs 24/7 without anyone noticing it is there.
What I actually use it for
- Privacy-first chat sessions — anything involving client data, contracts, internal documents
- Virtual assistant “Kim” — an agentic system that monitors email and watched folders, drafts replies, and flags items for attention
- Batch embedding for a local semantic search index across project documentation
- Document conversion and watermarking — automated pipeline triggered by file drops
- RAG (Retrieval-Augmented Generation) — local knowledge base queries without sending documents to external APIs
Connecting clients: the networking minimum
Assign the Mac Mini a static IP (e.g., 192.168.0.42). On client machines, add 192.168.0.42 brainbox.local to /etc/hosts. Bonjour works on the same subnet, but a static mapping survives VLANs and VPNs.
Any OpenAI-compatible client configuration looks like this:
import os
os.environ["OPENAI_API_BASE"] = "http://brainbox:1234/v1"
os.environ["OPENAI_MODEL_NAME"] = "openai/qwen2.5-coder-7b-instruct"
os.environ["OPENAI_API_KEY"] = "lmstudio_placeholder" # dummy key
Swap brainbox for your Mac Mini’s IP or hostname.
What it is not
This is not a replacement for cloud AI when you need frontier-model capability. GPT-4-class reasoning, 128K+ context windows, or real-time multi-modal processing still need bigger hardware or cloud APIs.
But for the 80% of tasks that involve processing private data with a competent model — summarisation, classification, drafting, code assistance, embedding — a silent box under the desk handles it without sending a single byte off-premises.
The takeaway
The base M4 Mac Mini is not a GPU monster. But for 7B to 14B models it feels like a dedicated inference appliance — drawing less power than a desk lamp, making no noise, and keeping your data exactly where it belongs. If your data cannot leave the premises, or you are simply done paying per token, this box is worth a spot on the desk.
This article was originally written in December 2024 after a month of daily use. The toolchain and model ecosystem continue to evolve — newer quantisation methods and models have expanded what is practical on Apple Silicon since then, but the core architecture and workflow remain the same.