Owning Your Inference Layer: Self-Hosted coding AI

Eran Goldman-Malka · May 6, 2026

AI DevTools

Frontier coding assistants are capable, but they arrive with constraint layers baked in. For security researchers, low-level systems work, or any task touching edge-case logic, those constraints become a productivity tax rather than a safety net. The model declines; the sprint stalls.

The alternative is architectural. DigitalOcean’s H100 and B300 GPU Droplets, billed per second, let you run 70B-parameter weights—models like DeepSeek-V4-Coder or Qwen-3-Coder—at throughput that rivals hosted frontier APIs. You provision hardware for a focused sprint, snapshot when done, and pay only for compute consumed: no idle GPU, no opaque rate-limit policies, no safety-layer updates you did not approve.

The bridge into your IDE is a single configuration change. Ollama serves your chosen weights over an OpenAI-compatible endpoint. Cursor’s API override points at that endpoint through an SSH tunnel or a firewall-scoped IP. Every subsequent generation request routes to your hardware—prompts, context, and IP stay on your infrastructure, not inside a third-party inference cluster.

Uncensored does not mean ungoverned. Models that skip refusal layers also skip some defensive checks, shifting output-validation responsibility entirely to the operator. That trade is reasonable for teams with mature review processes and a clear mandate; it is a liability for teams without them. Verify the output. Own the risk you accepted.

The deeper return is not the absent guardrails—it is sovereignty: reproducible builds, auditable inference paths, and a stack you can operate, migrate, or air-gap on your own schedule. In an environment where vendor policies change quarterly and export controls are tightening, owning your inference layer is an increasingly defensible engineering decision. Does your current toolchain give you the same guarantee?

What You Need

A DigitalOcean account
An H100 80GB GPU Droplet (B300 works too; anything less will bottleneck on a 70B model)
Cursor installed locally
SSH access from your workstation to the Droplet

Step 1: Provision the GPU Droplet

In the DigitalOcean control panel, create a new Droplet and select the GPU Droplets tier. Choose the NVIDIA AI/ML Ready base image (Ubuntu 22.04 with CUDA pre-installed). Pick the H100 80GB node and the datacenter region closest to you. Add your SSH key and create.

Once the Droplet is running, SSH in:

ssh root@YOUR_DROPLET_IP

Step 2: Install Ollama

Ollama handles model download, quantisation selection, and serving in a single binary.

curl -fsSL https://ollama.com/install.sh | sh

Verify the service is running:

ollama list

Step 3: Pull Your Model

For maximum coding capability at 80GB VRAM, pull the full-precision 70B variant:

ollama pull deepseek-coder-v2:236b

If you want a lighter option that still outperforms most hosted alternatives:

ollama pull qwen2.5-coder:72b

The first pull takes a few minutes. Once complete, Ollama exposes a local API on port 11434.

Step 4: Open a Secure Tunnel

Do not expose port 11434 publicly. Forward it to your local machine over SSH instead:

# Run this on your local workstation, not the Droplet
ssh -N -L 11434:localhost:11434 root@YOUR_DROPLET_IP

Leave this terminal open. All traffic to localhost:11434 on your machine now routes through the encrypted tunnel to the Droplet.

Step 5: Point Cursor at Your Droplet

Open Cursor → Settings → Models → Add custom model.

Field	Value
Base URL	`http://localhost:11434/v1`
API Key	`ollama` (any non-empty string)
Model name	`deepseek-coder-v2:236b` or whichever you pulled

Save, then open any file and trigger a generation. The request routes to your hardware. You will notice the difference in latency immediately on an H100 versus a congested API endpoint.

Step 6: Snapshot and Suspend

Per-second billing means you only pay while the Droplet is running. When you finish a session:

Power off the Droplet from the control panel (or shutdown -h now over SSH).
Take a snapshot so the model weights are preserved.
Restore from snapshot at the start of your next session—no re-download required.

Note: These models will complete requests that hosted APIs would decline. Review all generated output before use. You are the only line of defence.

Share: Twitter, Facebook