Authentication

All API requests require an API key with the inf_ prefix. Pass it via Authorization: Bearer inf_... or X-API-Key: inf_... header.

Endpoints

POST/v1/chat/completions
Enqueue a chat completion request. Returns a job ID for polling.
Auth: Bearer inf_*
GET/v1/results/:id
Poll for job result. Returns status, result, error, and retry count.
Auth: Bearer inf_*
DELETE/v1/results/:id
Cancel a pending or processing job.
Auth: Bearer inf_*
GET/v1/models
List available models from the staked Inference Bus (and Pistachio when requested).
Auth: Bearer inf_*
GET/health
Health check with queue statistics. No auth required.

Request format

Standard OpenAI-compatible chat completion body. The proxy queues the request and returns immediately.

{
  "model": "llama-3.3-70b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 300
}

Response flow

1. POST returns {"id": "uuid", "status": "queued", "poll": "/v1/results/uuid"}

2. Poll GET /v1/results/:id until status is complete or dead

3. Complete results include the full OpenAI-compatible response in the result field

Quick start

curl -X POST https://queue.inference.drm3.network/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama-3.3-70b","messages":[{"role":"user","content":"Hello!"}]}'

# Poll for result
curl https://queue.inference.drm3.network/v1/results/JOB_ID \
  -H "Authorization: Bearer YOUR_API_KEY"

Python

from openai import OpenAI

client = OpenAI(
    base_url="https://queue.inference.drm3.network/v1",
    api_key="YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Streaming

Set "stream": true in the request body to get a real-time SSE stream. Streaming bypasses the queue and proxies directly to the backend — no polling needed.

curl -N https://queue.inference.drm3.network/v1/chat/completions \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama-3.3-70b","messages":[{"role":"user","content":"Hello!"}],"stream":true}'

Rate limits

Default: 20 requests/minute, 100K tokens/day, 2 concurrent jobs. Limits are per-tenant and configurable.

Retry behavior

Non-streaming requests retry with exponential backoff (10s × 2^n, max 300s). After 50 retries the job moves to dead letter. Check the retries and error fields on the result. Streaming requests do not retry — errors return immediately.