Local AI for Personal Notes
When It Makes Sense and How to Set It Up
The tradeoffs of self-hosting LLMs with your Obsidian vault, and a practical Mac Studio setup guide if you decide it's worth it.
I use Claude Code paired with my Obsidian vault—not for coding, but for thinking. Trip planning, life admin, drafting essays, organising projects. The AI reads my files, understands my context, and helps me work through problems.
Recently I was wondering with a friend: could you self-host this? Run your own model, keep everything local?
The answer is yes, but it’s worth understanding what you’re actually buying before you spend the money.
What Cloud AI Actually Sees
When I say “have a look at my travel planning notes” and Claude reads those files, the contents are sent to Anthropic’s servers. Not just my prompt—the actual file contents. Every context file I share becomes part of the API request.
Over a long session, that might include: health notes, financial records, relationship reflections, friend profiles, travel itineraries, spiritual practice logs. The provider’s privacy policy is my only protection.
Anthropic’s stated policy is that API and Pro data isn’t used for training. But it’s still processed on their infrastructure, logged for abuse detection, and potentially accessible to employees. A data breach or sufficiently motivated adversary could expose it.
If that’s an acceptable tradeoff for you, stop here. The capability of frontier models is genuinely better, and the convenience is hard to beat.
If it’s not acceptable, keep reading.
Why “Encrypted Cloud GPU” Doesn’t Work
The obvious question: can’t you encrypt your data, send it to a cloud GPU, and get encrypted results back?
No. The GPU needs to see your data in plaintext to process it. Matrix multiplications don’t work on ciphertext.
The flow looks like this:
You (encrypt) → Cloud Server (decrypt → GPU → encrypt) → You
↑
Whoever controls this sees everything
The server can decrypt—that’s not the problem. The problem is that “end-to-end encryption” means only endpoints you control see plaintext. If you don’t own the hardware doing the decryption, it’s not your endpoint.
Confidential computing (AMD SEV, Intel TDX) encrypts VM memory so the hypervisor can’t read it. But GPU memory is separate—data has to leave the CPU enclave to reach the GPU. NVIDIA’s H100 extends the trust boundary to GPU memory, but it’s only on enterprise hardware, not consumer GPU clouds, and you’re still trusting NVIDIA’s attestation.
The practical reality: truly private cloud inference doesn’t exist yet at consumer prices. You’re choosing who to trust, not whether to trust.
The Actual Options
| Setup | Who Sees Your Data |
|---|---|
| OpenAI / Anthropic API | The provider |
| Consumer GPU cloud (RunPod, Vast.ai) | Cloud provider + you |
| Enterprise confidential computing | You trust attestation |
| Local hardware | You |
For genuine privacy, local hardware is the only clean answer.
When Local Makes Sense
Local inference makes sense if:
- You have genuine privacy requirements (legal, medical, personal)
- You’re philosophically opposed to the cloud trust model
- You want offline access (travel, unreliable internet)
- You enjoy tinkering and self-hosting
It doesn’t make sense if:
- You need frontier-level reasoning for complex problems
- Your time is expensive and you’d rather not maintain infrastructure
- Your threat model is actually fine with trusting Anthropic’s policies
Be honest about which bucket you’re in. The capability gap between open models and frontier models is real—roughly 70-80% for the tasks I care about. For some use cases that’s fine. For others it’s not.
The Vault as Abstraction Layer
Here’s an underappreciated advantage of using an Obsidian vault as your AI interface: it’s just a folder of markdown files. Any LLM can read and write to it. This creates a clean abstraction that enables:
Frictionless migration. Switching from Claude to a local model (or vice versa) requires no data migration. Your vault stays the same. Point a different tool at it and keep working.
Sensitivity-based partitioning. Use folder structure or tags to control access. Health notes in one folder, travel planning in another. Point Claude Code at the non-sensitive folders, use your local model for the rest.
Hybrid workflows. Run both simultaneously. Use Claude for complex reasoning on non-sensitive material. Use a local model for anything you’d rather keep private. The vault is the common ground.
A practical setup might look like:
Vault/
├── 01 Now/ # → Either (not sensitive)
├── 02 Inbox/ # → Either
├── 03 Projects/ # → Claude (needs reasoning)
├── 04 Areas/
│ ├── Health/ # → Local only
│ ├── Finance/ # → Local only
│ └── Travel/ # → Either
├── 05 Resources/ # → Either
└── 07 System/ # → Local only (contains context files)
You could implement this with a tagging convention (#local-only, #cloud-ok) and a wrapper script that filters which files the cloud model can see. Or simply be deliberate about which folders you reference in each session.
The point: because your knowledge base is plain files, you’re not locked into any provider. The vault is yours. The models are interchangeable.
Hardware
If you’ve decided local is right for you, the Mac Studio is the cleanest option.
Apple Silicon’s unified memory means the CPU and GPU share the same RAM pool. No copying tensors between devices, no separate VRAM limitations. A 70B model loads once and stays loaded.
Recommended configs (AUD pricing as of January 2026):
| Use case | Chip | Memory | Cost | Per person |
|---|---|---|---|---|
| Solo / 2 friends | Base (60-core GPU) | 96GB | $6,999 | $3,500 |
| 5-friend co-op | 80-core GPU | 256GB | $11,649 | $2,330 |
| 20-user service | 80-core GPU | 256GB | $11,649 | $582 |
Why these specs:
GPU cores matter for speed. The 80-core GPU is ~33% faster than the 60-core. That’s the difference between 20 tokens/sec and 27 tokens/sec—noticeable over a long session, and it compounds in a shared setup where faster inference means shorter queues.
96GB is enough for solo use. A 70B model at Q4 quantization needs ~40GB. You’ll have headroom for the OS and apps.
256GB for shared setups. Lets you run two 70B instances simultaneously, or experiment with larger models. Worth the upgrade when splitting costs.
Skip 512GB. It’s +$3,600 for memory you won’t use. Current best open models are 70B class. By the time 400B+ models are good enough to matter, the hardware will have moved on.
Skip extra storage. Models can live on an external drive or NAS.
Why M3 Ultra over M4 Max: The Ultra supports up to 512GB unified memory if you ever want to upgrade, and has more GPU cores. The Max tops out at 128GB.
Setup
Inference: Ollama
# Install
brew install ollama
# Start the server
ollama serve
# Pull a model
ollama pull qwen2.5:72b-instruct-q4_K_M
# Chat
ollama run qwen2.5:72b-instruct-q4_K_M
Ollama handles model management and serves an OpenAI-compatible API on localhost:11434.
Alternatives: LM Studio for a GUI, or llama.cpp for fine-grained control.
Models
| Model | Size (Q4) | Strengths |
|---|---|---|
| Qwen 2.5 72B Instruct | ~40GB | Best open model for coding and reasoning |
| Llama 3.3 70B Instruct | ~40GB | Strong all-rounder, good instruction-following |
| DeepSeek-R1 70B | ~40GB | Reasoning-focused |
Stick to 70B-class models with 96GB RAM. Larger models exist but leave less headroom.
Agentic Tooling
Running a model is one thing. Getting the Claude Code experience—where the AI reads files, runs commands, and edits your vault—requires an agentic wrapper.
Aider is the closest to Claude Code:
pip install aider-chat
# Use with Ollama
aider --model ollama/qwen2.5:72b-instruct-q4_K_M
# Run from your vault
cd /path/to/obsidian/vault
aider
Aider reads and edits files, understands git, and maintains conversation context.
Open Interpreter is more general-purpose and runs shell commands:
pip install open-interpreter
interpreter --local
Connecting to Obsidian
Your vault is just a folder of markdown files. Point your agentic tool at it:
cd ~/Documents/ObsidianVault
aider --model ollama/qwen2.5:72b-instruct-q4_K_M
Now you can say “read my Projects folder and summarise what’s in flight” or “update my travel notes.”
For tighter integration, the Obsidian Copilot and Text Generator plugins can use local Ollama endpoints directly within Obsidian.
What to Expect
Works well:
- Reading and summarising files
- Answering questions about your notes
- Simple edits and formatting
- Brainstorming and ideation
Noticeable gap:
- Complex multi-step reasoning
- Following nuanced, lengthy instructions
- Maintaining coherence over long conversations
- Catching subtle errors
The gap is real. Whether it matters depends on your use case.
Sharing the Cost
A Mac Studio is expensive for one person. But the economics change quickly when you share.
Two Friends
A Mac Studio doesn’t have to be a solo purchase. If you have a friend you trust with sensitive information anyway, you can split the cost and share the hardware.
The setup:
- One person hosts the Mac Studio at their place
- The other connects via Tailscale or WireGuard (secure tunnel, no port forwarding needed)
- SSH in, run aider in tmux, or expose the Ollama API over the tunnel
Cost: $3.5k each instead of $7k. Suddenly much more accessible.
Trust model: The host can technically inspect traffic if they want to—they control the hardware. But if you already share personal information with each other, you’re trading “trust Anthropic” for “trust my friend.” For most people, that’s an upgrade.
Concurrent usage: A 70B model needs ~40GB RAM. With 96GB total, you can comfortably run one instance. If you both want to use it simultaneously, you’d either coordinate, run two smaller models (two 30B models fit easily), or upgrade to 192GB for two full-size instances.
Things to figure out:
- Who handles maintenance and model updates?
- What’s the exit strategy if one person wants out?
- Is the host’s internet reliable enough?
These are solvable problems. The harder part is finding a friend who’s both technically inclined and trustworthy with your personal notes. If you have that, this is a genuinely good option.
Five Friends: A Small Co-op
Take this further: a top-spec Mac Studio shared among 4-5 close friends.
Economics:
| Config | Total | Per person (5 people) |
|---|---|---|
| M3 Ultra 96GB | ~$7k | ~$1,400 |
| M3 Ultra 192GB | ~$9k | ~$1,800 |
That’s genuinely cheap for unlimited local inference.
The coordination problem: Honour system works until everyone wants to run something important on their sacred Saturday morning Obsidian marathon. You need usage tracking.
LiteLLM as a metering layer:
LiteLLM can sit in front of Ollama and track usage per user:
pip install litellm
litellm --model ollama/qwen2.5:72b-instruct-q4_K_M --port 4000
Each friend gets an API key. LiteLLM logs tokens in/out per key and has a built-in dashboard.
Prepaid accounts: The cleanest billing model is prepaid. Everyone funds their account upfront—say $50 or $100 to start. Usage deducts from the balance. When you’re low, top up. No chasing friends for money, no awkward conversations. If someone’s balance hits zero, their API key stops working until they reload.
This also creates natural coordination: if you’re burning through credits, you’re incentivised to use it thoughtfully.
Practical setup:
- Tailscale for secure access (free for 3 users, $6/user/month beyond)
- One person runs sysadmin (maybe gets free usage or a larger share)
- Shared Signal/Discord channel for “heads up, running something heavy”
- Monthly settlement based on LiteLLM logs
The vibe is a private compute co-op. Everyone chips in, everyone benefits, no one’s sending their notes through a cloud provider.
Twenty Users: Service Mode
Scale further and the trust model changes. At 20 users, you’re not sharing with friends—you’re running a small inference service. Users don’t need access to your network. They just need an API.
Architecture:
Users → Cloudflare Tunnel → Caddy → LiteLLM → Ollama
↓ ↓
(no open ports) (API keys, metering, prepaid balance)
Key differences from the friend co-op:
| Aspect | 5 friends | 20 users |
|---|---|---|
| Network access | Tailscale VPN | API-only via HTTPS |
| SSH access | Yes | No |
| Trust required | High | Minimal |
| User isolation | Relaxed | Full—users can’t see each other |
| Ingress | Tailscale | Cloudflare Tunnel |
Components:
- Cloudflare Tunnel: Exposes your API to the internet without opening router ports. Free tier works fine.
- Caddy: HTTPS termination, reverse proxy.
- LiteLLM: API key authentication, per-user usage tracking, prepaid balance enforcement. Requests fail when balance hits zero.
- Ollama: Actual inference.
Users hit an HTTPS endpoint, authenticate with their API key, and their balance decrements with each request. Top-ups via PayPal, Wise, crypto, whatever works for your group.
Economics:
| Approach | Numbers |
|---|---|
| Buy-in model | $9k hardware ÷ 20 = $450/person |
| Usage-based | Charge per million tokens, recoup hardware over time |
| Hybrid | Small buy-in + usage fees |
At this scale you could run it as a micro-business. Not life-changing money, but the Mac Studio pays for itself and you get free inference forever.
Who’s this for?
A niche community that values privacy and has moderate technical literacy. A Discord server of AI-curious professionals. A group of writers, researchers, or consultants who don’t want their work going through OpenAI. Twenty people who each value local inference at $450 but not at $9,000.
You’re the operator. They’re the users. The Mac Studio sits in your apartment, they get an API endpoint, and everyone’s notes stay off the cloud.
Congestion Pricing
A Mac Studio can only run one inference at a time. If 20 users all want their Saturday morning Obsidian session simultaneously, someone’s waiting in a queue. Price can allocate scarce capacity.
Dynamic surge pricing:
Price floats based on queue depth:
def get_multiplier():
queue = get_queue_depth()
if queue < 3: return 1.0
if queue < 6: return 2.0
return 5.0
Expose a /current-price endpoint so users can check before submitting. Maybe 100-150 lines of Python in front of LiteLLM.
What this achieves:
- Heavy users subsidise the service (and their own capacity hogging)
- Light users get cheaper access
- Peak demand spreads out naturally
- No one’s stuck in a 30-minute queue unless they choose to be
Summary
Solo setup:
| Component | Recommendation |
|---|---|
| Hardware | Mac Studio M3 Ultra, 96GB (base) |
| Cost | $6,999 AUD |
| Inference | Ollama |
| Model | Qwen 2.5 72B Instruct Q4 |
| Agentic wrapper | Aider |
| Obsidian integration | CLI in vault folder, or Copilot plugin |
Shared setups:
| Scale | Hardware | Total cost | Per person | Trust model |
|---|---|---|---|---|
| 2 friends | 96GB | $6,999 | $3,500 | High (SSH) |
| 5 friends | 80-core, 256GB | $11,649 | $2,330 | High (Tailscale) |
| 20 users | 80-core, 256GB | $11,649 | $582 | Low (API-only) |
Setup time: an afternoon for solo, a weekend for the co-op infrastructure.
The GPU still needs to see your data. The difference is it’s your GPU, on your desk, and nothing leaves your network.