All API requests require an API key with the inf_ prefix. Pass it via Authorization: Bearer inf_... or X-API-Key: inf_... header.
Standard OpenAI-compatible chat completion body. The proxy queues the request and returns immediately.
{
"model": "llama-3.3-70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 300
}
1. POST returns {"id": "uuid", "status": "queued", "poll": "/v1/results/uuid"}
2. Poll GET /v1/results/:id until status is complete or dead
3. Complete results include the full OpenAI-compatible response in the result field
curl -X POST https://queue.inference.drm3.network/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.3-70b","messages":[{"role":"user","content":"Hello!"}]}'
# Poll for result
curl https://queue.inference.drm3.network/v1/results/JOB_ID \
-H "Authorization: Bearer YOUR_API_KEY"
from openai import OpenAI
client = OpenAI(
base_url="https://queue.inference.drm3.network/v1",
api_key="YOUR_API_KEY",
)
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Set "stream": true in the request body to get a real-time SSE stream. Streaming bypasses the queue and proxies directly to the backend — no polling needed.
curl -N https://queue.inference.drm3.network/v1/chat/completions \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.3-70b","messages":[{"role":"user","content":"Hello!"}],"stream":true}'
Default: 20 requests/minute, 100K tokens/day, 2 concurrent jobs. Limits are per-tenant and configurable.
Non-streaming requests retry with exponential backoff (10s × 2^n, max 300s). After 50 retries the job moves to dead letter. Check the retries and error fields on the result. Streaming requests do not retry — errors return immediately.