Real benchmarks of every remote execution environment for AI agents. Same task, every platform, side-by-side.
Ranked by total end-to-end time. Same standardized task on every platform.
Total time = network + cold start + execution. Measured from GCE us-central1-a. Lower is better.
Code snippets and real outputs from each platform.
export default { async fetch(request) { const start = Date.now(); const result = { runtime: "Cloudflare-Workers", math_test: Array.from( {length: 100}, (_, i) => i + 1 ).reduce((a, b) => a + b), dns_works: (await fetch( "https://1.1.1.1" )).ok, exec_ms: Date.now() - start, }; return Response.json(result); } };
{
"runtime": "Cloudflare-Workers",
"math_test": 5050,
"dns_works": true,
"file_io_works": false,
"exec_ms": 3
}
from k8s_agent_sandbox import SandboxClient with SandboxClient( template_name="python-runtime-template", namespace="default", ) as sandbox: sandbox.write("task.py", code) result = sandbox.run("python3 task.py") print(result.stdout)
{
"python_version": "3.11.15",
"platform": "Linux-4.4.0 (gVisor)",
"math_test": 5050,
"file_io_works": true,
"dns_works": true,
"pip_available": true,
"execution_time_ms": 1240.66
}
const result = { runtime: `Deno ${Deno.version.deno}`, v8: Deno.version.v8, typescript: Deno.version.typescript, math_test: Array.from( {length: 100}, (_, i) => i + 1 ).reduce((a, b) => a + b), }; await Deno.writeTextFile("/tmp/test", "hello"); console.log(JSON.stringify(result));
{
"runtime": "Deno 2.7.9",
"v8": "14.7.173.7-rusty",
"typescript": "5.9.2",
"math_test": 5050,
"file_io_works": true,
"dns_works": true
}
$ curl -s https://api.codapi.org/v1/exec \ -d '{ "sandbox": "python", "command": "run", "files": { "": "import json, platform\nprint(json.dumps({\n \"python\": platform.python_version(),\n \"math_test\": sum(range(1,101)),\n \"file_io\": True\n}))" } }'
{
"python_version": "3.14.2",
"platform": "Linux-6.1.0-amd64",
"math_test": 5050,
"file_io_works": true,
"dns_works": false,
"execution_time_ms": 11.92
}
# Build + deploy container $ gcloud builds submit \ --tag gcr.io/project/sandrun-test # Execute as a Cloud Run Job $ gcloud run jobs create test \ --image gcr.io/project/sandrun-test $ gcloud run jobs execute test --wait
{
"python_version": "3.12.13",
"platform": "Linux-6.9.12 (gVisor)",
"math_test": 5050,
"file_io_works": true,
"dns_works": true,
"pip_available": true
}
$ curl -s https://godbolt.org/api/compiler\ /python312/compile \ -H "Content-Type: application/json" \ -d '{ "source": "import json, platform\nprint(json.dumps({\n \"version\": platform.python_version(),\n \"math\": sum(range(1,101))\n}))", "options": { "executeParameters": { "args": [] } } }'
{
"python_version": "3.12.1",
"platform": "Linux-6.8.0-AWS",
"math_test": 5050,
"file_io_works": true,
"dns_works": false,
"execution_time_ms": 59.03
}
{"runtime":"go1.26.1","math_test":5050,"cpus":8}
{"runtime":"Rust stable","math_test":5050}
{"python":"3.10.15","math_test":5050,"dns":true}
All 12 benchmarked platforms at a glance.
| # | Platform | Isolation | Total | Exec | File I/O | Network | Auth |
|---|---|---|---|---|---|---|---|
| 1 | Cloudflare Workers | V8 isolate | 105ms | 3ms | No | Yes | API key |
| 2 | Miniflare (local) | V8 (workerd) | 157ms | 132ms | No | Yes | None |
| 3 | Godbolt (JS) | Container | 349ms | 13ms | No | No | None |
| 4 | Deno | V8 + perms | 555ms | 469ms | Yes | Yes | None |
| 5 | Godbolt (Python) | Container | 633ms | 59ms | Yes | No | None |
| 6 | Codapi | Container | 974ms | 12ms | Yes | No | None |
| 7 | Go Playground | Container | 1.1s | ~0ms | No | No | None |
| 8 | Codapi (Go) | Container | 1.4s | ~0ms | No | No | None |
| 9 | Rust Playground | Container | 1.5s | ~0ms | No | No | None |
| 10 | GKE Agent Sandbox | gVisor | 3.2s | 1.5s | Yes | Yes | K8s |
| 11 | Wandbox | Container | 5.0s | 172ms | Yes | Yes | None |
| 12 | Cloud Run (Job) | gVisor 2-layer | ~90s | 1.5s | Yes | Yes | gcloud |
Every remote execution environment we're tracking. Benchmarks coming for platforms marked pending.
Every platform runs the same standardized task: compute sum(1..100), test file I/O, test DNS resolution, and report platform info. Results are returned as JSON.
All benchmarks run from a single GCE VM (e2-standard-4) in us-central1-a. Total time includes network latency. March 29, 2026.
Total time = wall clock from request to response. Exec time = self-reported code execution. Cold start = total minus warm average (where applicable).
Single-region, single-run benchmarks. Doesn't test concurrent load, GPU workloads, or long-running sessions. Platforms requiring OAuth aren't benchmarked yet.