API reference
Ocular runs as a standard HTTP service on the host/port you deploy it to. All endpoints accept and return JSON unless otherwise noted. No authentication is applied at the network layer — the customer's network (VPC, firewall, reverse proxy) is the trust boundary. The license token enforces that the container is authorised to run; it does not per-request gate API calls.
Content type: application/json for request bodies. Responses are
application/json; charset=utf-8.
Endpoints
| Method | Path | Purpose |
|---|---|---|
POST |
/classify |
Score a conversation. The core endpoint. |
GET |
/health |
Readiness + mode + GPU info. |
GET |
/manifest |
Heads manifest + service introspection. |
POST /classify
Score a conversation. Returns a verdict, per-axis risk scores, matched signals, and (optionally) a per-turn trajectory or full detail vectors.
Request
{
"messages": [
{"role": "user", "content": "I've been feeling really down"}
],
"per_turn": false,
"trajectory_stride": 3,
"effort": 1,
"detail": false,
"log": false
}Either messages (OpenAI-style, canonical) or text (User: ...\n\nAssistant: ...
format, curl-friendly) must be provided. If both are given, messages wins.
| Field | Type | Default | Meaning |
|---|---|---|---|
messages |
array of {role, content} |
null |
Canonical input. role is "user" / "assistant" / "system". System messages are ignored for scoring. |
text |
string | "" |
User: ...\n\nAssistant: ... format. Double-newline separates turns. Equivalent to messages; use whichever is easier. |
per_turn |
bool | false |
Compute a trajectory: score at every Nth turn boundary. Adds latency proportional to turn count. |
trajectory_stride |
int | 3 |
Compute per-turn score every Nth boundary. 1 = every turn; 3 = every third turn (default). |
effort |
int [1, 3] |
1 |
1 (default) scores the input once, no stability. 2 and 3 are opt-in: additional passes over formatting-perturbed variants that emit a per-axis stability score. Variants travel through a shared GPU pass, so overhead is ~25-40%, not 2× or 3×. See Stability. |
detail |
bool | false |
When true, adds a detail object to the response with raw/calibrated head dicts and corroboration/contributors/heuristics_debug. Roughly doubles response size. |
log |
bool | false |
When true, Ocular pushes a summary to the configured Console (OCULAR_CONSOLE_URL) after scoring. Requires session_id + user_id; returns 400 if either is missing or if OCULAR_CONSOLE_URL isn't configured. Use messages[] (not text:) on logged requests — only messages gives Console the per-turn transcript; text: lands a scored session with empty turns[], silently. |
session_id |
string | null |
Metadata. Correlates this call to a conversation. Only consumed when log=true. |
user_id |
string | null |
Metadata. Identifies the end-user. Only consumed when log=true. |
agent_id |
string | null |
Metadata. Optional identifier for which AI agent produced the assistant turns. Only consumed when log=true. |
Response
Top-level fields (always present on a 200 response):
{
"verdict": "watch",
"subject": "self",
"imminence": {"level": "moderate", "score": 0.23},
"fiction": 0.05,
"authenticity": 0.78,
"risks": {
"suicide": {"level": "critical", "score": 0.73},
"self_harm": {"level": "moderate", "score": 0.12},
"harm_to_others": {"level": "minimal", "score": 0.01},
"abuse": {"level": "minimal", "score": 0.02},
"sexual_violence": {"level": "minimal", "score": 0.00},
"exploitation": {"level": "minimal", "score": 0.01},
"stalking": {"level": "minimal", "score": 0.00},
"self_neglect": {"level": "minimal", "score": 0.04}
},
"ai_concerns": {
"harm_provision": {"level": "minimal", "score": 0.00},
"emotional_failure": {"level": "minimal", "score": 0.00},
"manipulation": {"level": "minimal", "score": 0.01},
"safeguarding_failure": {"level": "minimal", "score": 0.02}
},
"signals": [
{"code": "signal_0001", "score": 0.81},
{"code": "signal_0010", "score": 0.44}
],
"effort": 1,
"stability": null,
"meta": {
"version": "3ba4207",
"inference_ms": 28,
"windowed": false,
"windows": 1,
"request_id": "ab12cd34ef56"
}
}verdict — the aggregate classification
| Value | What Ocular classified |
|---|---|
"clear" |
No elevated signals, or signals were descoped by fiction framing. |
"watch" |
Elevated signals below Ocular's danger thresholds. |
"danger" |
Elevated signals at or above Ocular's danger thresholds. |
Use verdict as the single field your rules engine keys off. Per-axis
levels give detail for UI; the verdict is the authoritative aggregate.
Exact per-axis thresholds are internal and fiction-modulated — see
risk-interpretation.md for the verdict shape table.
subject
"self" (speaker is the party-at-risk described by the signals), "other"
(speaker is reporting someone else's situation), or "unknown" (ambiguous).
Only "self" can drive verdict above "clear" on user-side axes —
third-person disclosure changes what the signals mean.
risks — 8 user-side axes
Every axis always appears with {level, score}. score is [0, 1].
level is one of:
| Value | Score range |
|---|---|
"minimal" |
< 0.05 |
"low" |
[0.05, 0.12) |
"moderate" |
[0.12, 0.25) |
"high" |
[0.25, 0.45) |
"critical" |
≥ 0.45 |
Axes: suicide, self_harm, harm_to_others, abuse, sexual_violence,
exploitation, stalking, self_neglect.
ai_concerns — 4 AI-side axes
Same {level, score} shape as risks. Axes cover the assistant side:
harm_provision (providing harmful info/methods), emotional_failure
(missed empathic cues), manipulation (coercive or exploitative patterns),
safeguarding_failure (absence of boundary-setting / redirect behavior —
not fiction-descoped for minor-involving content; fires regardless of
framing).
imminence — {level, score}
Temporal acuity marker. "high" means the conversation carries near-term
acuity markers (plan, means, timeline, preparatory language) rather than
chronic-pattern signal. Same level domain as risks.
fiction / authenticity
Two scalars ([0, 1]):
fiction— how much the conversation reads as fiction/roleplay. Drives verdict suppression: high fiction with no corroborating distress signals won't escalate to watch/danger.authenticity— counter-signal. Markers of register-authentic distress (genuine emotional disclosure, frame breaks out of RP).
Both are modifiers, not alerts. They're always present in the response so your UI can surface "suppressed in RP context" explanations.
signals[]
Ranked list of head firings above each head's screening operating
point — a per-head calibrated threshold. Sorted by calibrated score
descending. Use for building "what fired" UIs; don't threshold on score
yourself — the screening operating point already filtered the noise
floor out.
Each entry:
{"code": "signal_0001", "score": 0.81}code is an opaque identifier (signal_NNNN). The mapping from signal_NNNN
to our internal head taxonomy is frozen — a given ID means the same head
forever, even if the underlying taxonomy is renamed. Referenced in support
tickets for diagnostics; not intended for customer rules.
effort / stability
effort is always present (1, 2, or 3), echoing what was requested.
stability is null at effort=1 and a dict at effort > 1:
"stability": {
"suicide": 1.0,
"self_harm": 0.98,
"harm_to_others": 1.0,
"abuse": 1.0,
"sexual_violence": 1.0,
"exploitation": 1.0,
"stalking": 1.0,
"self_neglect": 1.0,
"ai_harm_provision": 1.0,
"ai_emotional_failure": 1.0,
"ai_manipulation": 1.0,
"ai_safeguarding_failure": 1.0,
"imminence": 0.91
}Each value is 1 - (stddev / mean) clamped to [0, 1] across per-variant
risk scores for that axis. Axes with no signal to be unstable about default
to 1.0. The imminence entry in particular often reads 1.0 —
imminence only fires on a few axes, so on conversations without those
markers the default kicks in. That's expected, not a bug.
Key-naming note. AI-side axes appear as
ai_harm_provision/ai_emotional_failure/ai_manipulation/ai_safeguarding_failurein this flat dict — theai_prefix is needed because the user-risk axes (suicide,self_harm, etc.) share the same keyspace. In the nestedai_concernsobject above, the prefix is dropped (scoped disambiguation).
See Stability for how to use it.
meta
Diagnostic block. version = model + heads version. inference_ms =
wall-clock for the scoring pass (excludes queue / HTTP). request_id =
correlation ID; also in server logs. windowed / windows = set when the
input was too long and had to be split across overlapping windows.
Per-turn trajectory (per_turn: true)
Adds a trajectory[] array. One entry per sampled turn boundary:
{
"trajectory": [
{"role": "user", "turn": 0, "verdict": "clear", "signals": []},
{"role": "assistant", "turn": 1, "verdict": "clear", "signals": []},
{"role": "user", "turn": 2, "verdict": "watch",
"signals": [{"code": "signal_0001", "score": 0.81}]}
]
}Each entry carries its own verdict (computed through the full fusion
layer per turn, not copied from the top-level) and filtered signals[].
Use for rendering trajectory sparklines and detecting regime shifts like
clear → watch transitions.
With trajectory_stride: N, scoring runs every Nth turn boundary (default
3). Stride 1 is slowest but most granular; 3 is recommended for long
conversations.
When detail: true, each trajectory entry also includes raw_scores
(filtered to codes above 0.01) and calibrated (filtered to codes above
0.05). Significantly increases response size on long conversations.
Detail mode (detail: true)
Adds a detail object with the full raw data:
{
"detail": {
"scores": {"signal_0001": 0.81, "signal_0002": 0.12, "...": "..."},
"calibrated": {"signal_0001": 0.88, "signal_0002": 0.15, "...": "..."},
"corroboration": {
"suicide": {"strength": "strong", "score": 0.82},
"self_harm": {"strength": "limited", "score": 0.21},
"harm_to_others": {"strength": "absent", "score": 0.00},
"abuse": {"strength": "absent", "score": 0.00}
},
"contributors": {
"suicide": [{"code": "signal_0001", "label": "Suicidal ideation", "score": 0.82, "role": "base"}, "..."],
"harm_to_others": ["..."],
"ecosystem": ["..."]
},
"heuristics_debug": {
"headline_eco": 0.25,
"fiction_gate": false,
"...": "..."
}
}
}| Field | Meaning |
|---|---|
scores |
Raw head probability per code ([0, 1]), every head. |
calibrated |
Calibrated probability per code, comparable across codes. Use this when comparing scores between different signals. |
corroboration |
Cross-head corroboration per axis — how many independent corroborator codes co-fired alongside the primary axis signal. strength is absent / limited / moderate / strong. The strength key is used here (not level) to avoid collision with risks.<axis>.level (which uses a different domain). |
contributors |
Per-axis breakdown of which heads contributed, with scores and weights. For explanation UIs. |
heuristics_debug |
Raw inputs to the verdict calculation. For debugging the fusion layer; field stability is not guaranteed — don't rely on specific keys. |
Use detail=true sparingly — the raw vectors are large. A production
integration that wants to key off verdict + signals[] never needs
detail.
Stability (effort > 1)
A safety classification shouldn't depend on whether the user typed two spaces or one. Ocular exposes an opt-in diagnostic that measures this.
Mechanism. At effort=2 or effort=3, Ocular scores the conversation
multiple times against formatting-perturbed copies:
effort=1— score the input as given. Default.effort=2— two passes: (a) input as given, (b) a copy with internal whitespace collapsed to single spaces. Both preserve turn boundaries and semantic content; only intra-turn whitespace changes.effort=3— three passes: the two above plus a copy with a newline inserted after each sentence-ending punctuation mark.
The primary response fields (verdict, risks, signals) always reflect
the original input — they don't change based on effort. What you get
additionally is a per-axis stability score:
1.0— scored identically across every variant. Maximally stable.≥ 0.9— wobbles by ~10% relative to the mean. Still robust.0.5 - 0.8— meaningful variance. Treat the axis as soft; consider asking for more context before acting.< 0.5— classification is formatting-sensitive; the signal is fragile.
When to use it. Auditing borderline cases (verdict == "watch" or
risks.<axis>.level == "high"), tuning thresholds against your own data,
or high-stakes pipelines where a false positive has real operational cost.
When not to. Anything already at verdict=clear with near-zero scores
— stability is uninformative at low magnitude (hence the 1.0 default for
axes with mean < 0.005).
Latency cost. All variants for a single request travel through one GPU
forward pass together, so effort=3 adds roughly 25-40% latency over
effort=1, not 3×. Under high concurrent load the overhead is slightly
larger because variants take up batch-queue slots that would otherwise
hold other requests' variant-0 only.
Signal identifiers
Entries in signals[] (and keys under detail.* when detail: true) are
opaque IDs of the form signal_NNNN. They're intentionally anonymous and
not a supported decision surface — use verdict + risks.<axis>.level
instead. IDs are durable across releases (same ID → same concept), so
dashboards and support tickets that quote them stay valid over time.
Status codes
| Code | When |
|---|---|
200 |
Normal response. |
400 |
Malformed request (missing both text and messages, bad JSON, log=true without OCULAR_CONSOLE_URL configured, or log=true without both session_id and user_id). |
413 |
Request body exceeded MAX_REQUEST_BYTES (default 1 MiB). |
429 |
Batch queue saturated. Retry with backoff. |
502 |
Remote scoring failed. Only emitted when the Ocular container is configured to proxy inference to an upstream SCORING_URL (Ocular-side env; distinct from Console's OCULAR_URL). |
503 |
Model still warming up (first ~25 s after container start). Retry with backoff. |
Examples
Minimal. One user turn, curl-friendly text format:
curl -s -X POST http://localhost:8080/classify \
-H 'Content-Type: application/json' \
-d '{"text":"User: I have been feeling really down lately"}'Canonical. Same request via messages:
curl -s -X POST http://localhost:8080/classify \
-H 'Content-Type: application/json' \
-d '{
"messages": [
{"role": "user", "content": "I have been feeling really down lately"}
]
}'With logging to Console. Requires Console configured via
OCULAR_CONSOLE_URL:
curl -s -X POST http://localhost:8080/classify \
-H 'Content-Type: application/json' \
-d '{
"text": "User: I havent felt like myself lately",
"session_id": "conv-42",
"user_id": "u-1234",
"log": true
}'Trajectory over a multi-turn conversation. stride: 1 scores every turn:
curl -s -X POST http://localhost:8080/classify \
-H 'Content-Type: application/json' \
-d '{
"text": "User: hi\n\nAssistant: hello\n\nUser: i need help\n\nAssistant: with what?",
"per_turn": true,
"trajectory_stride": 1
}'High-confidence borderline audit. effort=2 adds stability:
curl -s -X POST http://localhost:8080/classify \
-H 'Content-Type: application/json' \
-d '{"messages": [{"role":"user","content":"..."}], "effort": 2}'GET /health
Readiness probe. No request body.
{
"status": "ok",
"mode": "local",
"version": "3ba4207",
"queue_depth": 0
}| Field | Meaning |
|---|---|
status |
"ok" when the model is loaded and accepting requests. "starting" during warmup (~25 s after container start). |
mode |
"local" (GPU inference — the default) or "remote" (scoring proxied via an external URL, set by SCORING_URL in the container env). |
version |
Deployed release tag. |
queue_depth |
Current inference queue depth. 0 means idle. |
Use status == "ok" as your readiness gate.
GET /manifest
Service manifest — confirms the deployed release.
{
"version": "3ba4207",
"mode": "local",
"heads": 126
}| Field | Meaning |
|---|---|
version |
Deployed release tag. Matches OCULAR_VERSION in your .env. |
mode |
"local" / "remote" / "stub". |
heads |
Number of behavioral probe heads loaded. Omitted in remote mode. |
Use this to verify the deployed release version matches what you expect.
The manifest is pinned to the release you deployed — it doesn't change
at runtime. Build rules against verdict / risks.<axis>.level /
risks.<axis>.score, not against individual signal_NNNN IDs (watchlist
rules against specific signals are rejected at write-time).
Response headers
Every /classify response (and a few others) includes license-status headers
so you can surface expiration warnings in your own UI:
X-Ocular-License-Status: ok | expiring | grace
X-Ocular-License-Days-Remaining: 364
X-Ocular-License-Expires: 1808053439
X-Ocular-Token-Expiring: true (only when status ≠ ok)
status |
When | What to do |
|---|---|---|
ok |
License valid, > 14 days remaining. | Nothing. |
expiring |
License valid, ≤ 14 days remaining. | Contact NOPE for a renewal. |
grace |
License expired, within 72-hour grace period. Still serving. | Urgent: replace the token. After grace ends the container refuses to start on next restart; a running container keeps serving until it's restarted. |
The license check runs at startup. A running container with an expired
grace period keeps serving until the next restart — the gate is at
container boot, not per-request. On restart, an expired-past-grace token
exits with LicenseError. See deployment.md §troubleshooting.
Rate limits
Ocular itself doesn't rate-limit — it will score as fast as your hardware
allows; under sustained overload, /classify returns 429 once the
in-flight queue saturates. If you need ingress protection (e.g. against
runaway loops in your own app), put a reverse proxy with rate limiting in
front.
Practical throughput on a 20 GB datacenter-class GPU (RTX 4000 SFF Ada reference; production target for the on-prem image):
- Single-turn
/classify: ~188 ms p50, ~3.76 req/s sustained at concurrency=4 (zero-fail under 4× burst). Measured on the production image on gex44-prod, 2026-05-05 — see internal benchmark for full numbers. - Trajectory (stride=3, ~68 turns): ~880 ms p50 at concurrency=1.
- Cold-start adds ~28 s on first request after container start.
A10G (24 GB) and H100 (80 GB) produce comparable trajectory latency — the workload is memory-bandwidth bound. For higher throughput, run
For higher throughput, run multiple containers behind a load balancer. License tokens don't encode a container limit — any tier can run a fleet technically — but container count is a contractual matter. Check your contract before scaling horizontally.
What /classify is NOT
- Not predictive. Scores reflect what's present in the conversation, not what will happen next. A high suicide score does not mean the user will attempt self-harm; it means the conversation carries signals Ocular associates with that axis.
- Not diagnostic. Ocular does not diagnose. Treat output as a classification signal, not a diagnostic assessment.
- Not a replacement for clinical judgment. Humans make the call.
- Not deterministic across inference engines. We pin dependency versions, but small numerical drift (<5% on raw scores) between our internal reference endpoints and the Docker deployment is expected. Relative rankings are stable; absolute thresholds should be tuned against your own baseline.