I use Claude Code paired with my Obsidian vault—not for coding, but for thinking. Trip planning, life admin, drafting essays, organising projects. The AI reads my files, understands my context, and helps me work through problems.

Recently I was wondering with a friend: could you self-host this? Run your own model, keep everything local?

The answer is yes, but it’s worth understanding what you’re actually buying before you spend the money.

What Cloud AI Actually Sees

When I say “have a look at my travel planning notes” and Claude reads those files, the contents are sent to Anthropic’s servers. Not just my prompt—the actual file contents. Every context file I share becomes part of the API request.

Over a long session, that might include: health notes, financial records, relationship reflections, friend profiles, travel itineraries, spiritual practice logs. The provider’s privacy policy is my only protection.

Anthropic’s stated policy is that API and Pro data isn’t used for training. But it’s still processed on their infrastructure, logged for abuse detection, and potentially accessible to employees. A data breach or sufficiently motivated adversary could expose it.

If that’s an acceptable tradeoff for you, stop here. The capability of frontier models is genuinely better, and the convenience is hard to beat.

If it’s not acceptable, keep reading.

Why “Encrypted Cloud GPU” Doesn’t Work

The obvious question: can’t you encrypt your data, send it to a cloud GPU, and get encrypted results back?

No. The GPU needs to see your data in plaintext to process it. Matrix multiplications don’t work on ciphertext.

The flow looks like this:

You (encrypt) → Cloud Server (decrypt → GPU → encrypt) → You
                        ↑
                Whoever controls this sees everything

The server can decrypt—that’s not the problem. The problem is that “end-to-end encryption” means only endpoints you control see plaintext. If you don’t own the hardware doing the decryption, it’s not your endpoint.

Confidential computing (AMD SEV, Intel TDX) encrypts VM memory so the hypervisor can’t read it. But GPU memory is separate—data has to leave the CPU enclave to reach the GPU. NVIDIA’s H100 extends the trust boundary to GPU memory, but it’s only on enterprise hardware, not consumer GPU clouds, and you’re still trusting NVIDIA’s attestation.

The practical reality: truly private cloud inference doesn’t exist yet at consumer prices. You’re choosing who to trust, not whether to trust.

The Actual Options

SetupWho Sees Your Data
OpenAI / Anthropic APIThe provider
Consumer GPU cloud (RunPod, Vast.ai)Cloud provider + you
Enterprise confidential computingYou trust attestation
Local hardwareYou

For genuine privacy, local hardware is the only clean answer.

When Local Makes Sense

Local inference makes sense if:

  • You have genuine privacy requirements (legal, medical, personal)
  • You’re philosophically opposed to the cloud trust model
  • You want offline access (travel, unreliable internet)
  • You enjoy tinkering and self-hosting

It doesn’t make sense if:

  • You need frontier-level reasoning for complex problems
  • Your time is expensive and you’d rather not maintain infrastructure
  • Your threat model is actually fine with trusting Anthropic’s policies

Be honest about which bucket you’re in. The capability gap between open models and frontier models is real—roughly 70-80% for the tasks I care about. For some use cases that’s fine. For others it’s not.

The Vault as Abstraction Layer

Here’s an underappreciated advantage of using an Obsidian vault as your AI interface: it’s just a folder of markdown files. Any LLM can read and write to it. This creates a clean abstraction that enables:

Frictionless migration. Switching from Claude to a local model (or vice versa) requires no data migration. Your vault stays the same. Point a different tool at it and keep working.

Sensitivity-based partitioning. Use folder structure or tags to control access. Health notes in one folder, travel planning in another. Point Claude Code at the non-sensitive folders, use your local model for the rest.

Hybrid workflows. Run both simultaneously. Use Claude for complex reasoning on non-sensitive material. Use a local model for anything you’d rather keep private. The vault is the common ground.

A practical setup might look like:

Vault/
├── 01 Now/           # → Either (not sensitive)
├── 02 Inbox/         # → Either
├── 03 Projects/      # → Claude (needs reasoning)
├── 04 Areas/
│   ├── Health/       # → Local only
│   ├── Finance/      # → Local only
│   └── Travel/       # → Either
├── 05 Resources/     # → Either
└── 07 System/        # → Local only (contains context files)

You could implement this with a tagging convention (#local-only, #cloud-ok) and a wrapper script that filters which files the cloud model can see. Or simply be deliberate about which folders you reference in each session.

The point: because your knowledge base is plain files, you’re not locked into any provider. The vault is yours. The models are interchangeable.

Hardware

If you’ve decided local is right for you, the Mac Studio is the cleanest option.

Apple Silicon’s unified memory means the CPU and GPU share the same RAM pool. No copying tensors between devices, no separate VRAM limitations. A 70B model loads once and stays loaded.

Recommended configs (AUD pricing as of January 2026):

Use caseChipMemoryCostPer person
Solo / 2 friendsBase (60-core GPU)96GB$6,999$3,500
5-friend co-op80-core GPU256GB$11,649$2,330
20-user service80-core GPU256GB$11,649$582

Why these specs:

  • GPU cores matter for speed. The 80-core GPU is ~33% faster than the 60-core. That’s the difference between 20 tokens/sec and 27 tokens/sec—noticeable over a long session, and it compounds in a shared setup where faster inference means shorter queues.

  • 96GB is enough for solo use. A 70B model at Q4 quantization needs ~40GB. You’ll have headroom for the OS and apps.

  • 256GB for shared setups. Lets you run two 70B instances simultaneously, or experiment with larger models. Worth the upgrade when splitting costs.

  • Skip 512GB. It’s +$3,600 for memory you won’t use. Current best open models are 70B class. By the time 400B+ models are good enough to matter, the hardware will have moved on.

  • Skip extra storage. Models can live on an external drive or NAS.

Why M3 Ultra over M4 Max: The Ultra supports up to 512GB unified memory if you ever want to upgrade, and has more GPU cores. The Max tops out at 128GB.

Setup

Inference: Ollama

# Install
brew install ollama

# Start the server
ollama serve

# Pull a model
ollama pull qwen2.5:72b-instruct-q4_K_M

# Chat
ollama run qwen2.5:72b-instruct-q4_K_M

Ollama handles model management and serves an OpenAI-compatible API on localhost:11434.

Alternatives: LM Studio for a GUI, or llama.cpp for fine-grained control.

Models

ModelSize (Q4)Strengths
Qwen 2.5 72B Instruct~40GBBest open model for coding and reasoning
Llama 3.3 70B Instruct~40GBStrong all-rounder, good instruction-following
DeepSeek-R1 70B~40GBReasoning-focused

Stick to 70B-class models with 96GB RAM. Larger models exist but leave less headroom.

Agentic Tooling

Running a model is one thing. Getting the Claude Code experience—where the AI reads files, runs commands, and edits your vault—requires an agentic wrapper.

Aider is the closest to Claude Code:

pip install aider-chat

# Use with Ollama
aider --model ollama/qwen2.5:72b-instruct-q4_K_M

# Run from your vault
cd /path/to/obsidian/vault
aider

Aider reads and edits files, understands git, and maintains conversation context.

Open Interpreter is more general-purpose and runs shell commands:

pip install open-interpreter
interpreter --local

Connecting to Obsidian

Your vault is just a folder of markdown files. Point your agentic tool at it:

cd ~/Documents/ObsidianVault
aider --model ollama/qwen2.5:72b-instruct-q4_K_M

Now you can say “read my Projects folder and summarise what’s in flight” or “update my travel notes.”

For tighter integration, the Obsidian Copilot and Text Generator plugins can use local Ollama endpoints directly within Obsidian.

What to Expect

Works well:

  • Reading and summarising files
  • Answering questions about your notes
  • Simple edits and formatting
  • Brainstorming and ideation

Noticeable gap:

  • Complex multi-step reasoning
  • Following nuanced, lengthy instructions
  • Maintaining coherence over long conversations
  • Catching subtle errors

The gap is real. Whether it matters depends on your use case.

Sharing the Cost

A Mac Studio is expensive for one person. But the economics change quickly when you share.

Two Friends

A Mac Studio doesn’t have to be a solo purchase. If you have a friend you trust with sensitive information anyway, you can split the cost and share the hardware.

The setup:

  • One person hosts the Mac Studio at their place
  • The other connects via Tailscale or WireGuard (secure tunnel, no port forwarding needed)
  • SSH in, run aider in tmux, or expose the Ollama API over the tunnel

Cost: $3.5k each instead of $7k. Suddenly much more accessible.

Trust model: The host can technically inspect traffic if they want to—they control the hardware. But if you already share personal information with each other, you’re trading “trust Anthropic” for “trust my friend.” For most people, that’s an upgrade.

Concurrent usage: A 70B model needs ~40GB RAM. With 96GB total, you can comfortably run one instance. If you both want to use it simultaneously, you’d either coordinate, run two smaller models (two 30B models fit easily), or upgrade to 192GB for two full-size instances.

Things to figure out:

  • Who handles maintenance and model updates?
  • What’s the exit strategy if one person wants out?
  • Is the host’s internet reliable enough?

These are solvable problems. The harder part is finding a friend who’s both technically inclined and trustworthy with your personal notes. If you have that, this is a genuinely good option.

Five Friends: A Small Co-op

Take this further: a top-spec Mac Studio shared among 4-5 close friends.

Economics:

ConfigTotalPer person (5 people)
M3 Ultra 96GB~$7k~$1,400
M3 Ultra 192GB~$9k~$1,800

That’s genuinely cheap for unlimited local inference.

The coordination problem: Honour system works until everyone wants to run something important on their sacred Saturday morning Obsidian marathon. You need usage tracking.

LiteLLM as a metering layer:

LiteLLM can sit in front of Ollama and track usage per user:

pip install litellm
litellm --model ollama/qwen2.5:72b-instruct-q4_K_M --port 4000

Each friend gets an API key. LiteLLM logs tokens in/out per key and has a built-in dashboard.

Prepaid accounts: The cleanest billing model is prepaid. Everyone funds their account upfront—say $50 or $100 to start. Usage deducts from the balance. When you’re low, top up. No chasing friends for money, no awkward conversations. If someone’s balance hits zero, their API key stops working until they reload.

This also creates natural coordination: if you’re burning through credits, you’re incentivised to use it thoughtfully.

Practical setup:

  • Tailscale for secure access (free for 3 users, $6/user/month beyond)
  • One person runs sysadmin (maybe gets free usage or a larger share)
  • Shared Signal/Discord channel for “heads up, running something heavy”
  • Monthly settlement based on LiteLLM logs

The vibe is a private compute co-op. Everyone chips in, everyone benefits, no one’s sending their notes through a cloud provider.

Twenty Users: Service Mode

Scale further and the trust model changes. At 20 users, you’re not sharing with friends—you’re running a small inference service. Users don’t need access to your network. They just need an API.

Architecture:

Users → Cloudflare Tunnel → Caddy → LiteLLM → Ollama
              ↓                ↓
        (no open ports)   (API keys, metering, prepaid balance)

Key differences from the friend co-op:

Aspect5 friends20 users
Network accessTailscale VPNAPI-only via HTTPS
SSH accessYesNo
Trust requiredHighMinimal
User isolationRelaxedFull—users can’t see each other
IngressTailscaleCloudflare Tunnel

Components:

  • Cloudflare Tunnel: Exposes your API to the internet without opening router ports. Free tier works fine.
  • Caddy: HTTPS termination, reverse proxy.
  • LiteLLM: API key authentication, per-user usage tracking, prepaid balance enforcement. Requests fail when balance hits zero.
  • Ollama: Actual inference.

Users hit an HTTPS endpoint, authenticate with their API key, and their balance decrements with each request. Top-ups via PayPal, Wise, crypto, whatever works for your group.

Economics:

ApproachNumbers
Buy-in model$9k hardware ÷ 20 = $450/person
Usage-basedCharge per million tokens, recoup hardware over time
HybridSmall buy-in + usage fees

At this scale you could run it as a micro-business. Not life-changing money, but the Mac Studio pays for itself and you get free inference forever.

Who’s this for?

A niche community that values privacy and has moderate technical literacy. A Discord server of AI-curious professionals. A group of writers, researchers, or consultants who don’t want their work going through OpenAI. Twenty people who each value local inference at $450 but not at $9,000.

You’re the operator. They’re the users. The Mac Studio sits in your apartment, they get an API endpoint, and everyone’s notes stay off the cloud.

Congestion Pricing

A Mac Studio can only run one inference at a time. If 20 users all want their Saturday morning Obsidian session simultaneously, someone’s waiting in a queue. Price can allocate scarce capacity.

Dynamic surge pricing:

Price floats based on queue depth:

def get_multiplier():
    queue = get_queue_depth()
    if queue < 3: return 1.0
    if queue < 6: return 2.0
    return 5.0

Expose a /current-price endpoint so users can check before submitting. Maybe 100-150 lines of Python in front of LiteLLM.

What this achieves:

  • Heavy users subsidise the service (and their own capacity hogging)
  • Light users get cheaper access
  • Peak demand spreads out naturally
  • No one’s stuck in a 30-minute queue unless they choose to be

Summary

Solo setup:

ComponentRecommendation
HardwareMac Studio M3 Ultra, 96GB (base)
Cost$6,999 AUD
InferenceOllama
ModelQwen 2.5 72B Instruct Q4
Agentic wrapperAider
Obsidian integrationCLI in vault folder, or Copilot plugin

Shared setups:

ScaleHardwareTotal costPer personTrust model
2 friends96GB$6,999$3,500High (SSH)
5 friends80-core, 256GB$11,649$2,330High (Tailscale)
20 users80-core, 256GB$11,649$582Low (API-only)

Setup time: an afternoon for solo, a weekend for the co-op infrastructure.

The GPU still needs to see your data. The difference is it’s your GPU, on your desk, and nothing leaves your network.