Edge Self-Hosted
Run NOPE's safety classifier on your own infrastructure using our fine-tuned models. No data leaves your network.
Model Variants
| Model | Parameters | Latency | Use Case |
|---|---|---|---|
| nope-edge | 4B | ~750ms | Maximum accuracy |
| nope-edge-mini | 1.7B | ~260ms | High-volume, cost-sensitive |
Requirements
- GPU: Any NVIDIA GPU with bfloat16 support (RTX 3060+, L4, A10G, A100, H100)
- VRAM: ~4GB (mini) or ~8GB (full)
- Python: 3.10+ with PyTorch and transformers
Note: T4 GPUs are not supported (bfloat16 requires compute capability 8.0+).
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "nopenet/nope-edge" # or "nopenet/nope-edge-mini"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
def classify(message: str) -> str:
"""Returns XML with reflection and risk classification."""
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": message}],
tokenize=True,
return_tensors="pt",
add_generation_prompt=True
).to(model.device)
with torch.no_grad():
output = model.generate(input_ids, max_new_tokens=300, do_sample=False)
return tokenizer.decode(
output[0][input_ids.shape[1]:],
skip_special_tokens=True
).strip()
# Example
result = classify("I want to end it all tonight")
print(result)
# Output:
# <reflection>The user directly expresses intent to end their life...</reflection>
# <risks>
# <risk subject="self" type="suicide" severity="high" imminence="urgent"/>
# </risks>Output Format
The model outputs XML with two components: a <reflection> explaining the reasoning,
and <risks> containing structured risk elements (or <risks/> if no crisis detected).
# No risk detected
<reflection>The user is sharing a positive update about work...</reflection>
<risks/>
# Risk detected
<reflection>The user directly expresses intent to end their life...</reflection>
<risks>
<risk subject="self" type="suicide" severity="high" imminence="urgent"/>
</risks>
# Third-party concern
<reflection>The user is reporting concern about a friend...</reflection>
<risks>
<risk subject="other" type="self_harm" severity="moderate" imminence="chronic"/>
</risks>Risk Attributes
| Attribute | Description |
|---|---|
| subject | Who is at risk (self or other) |
| type | Risk category (see below) |
| severity | Urgency level (see below) |
| imminence | Time sensitivity (see below) |
| features | Comma-separated specific indicators (optional) |
Risk Types
| Type | Description |
|---|---|
| suicide | Suicidal ideation, plans, or intent |
| self_harm | Non-suicidal self-injury |
| self_neglect | Eating disorders, medical neglect |
| violence | Threats or plans of violence toward others |
| abuse | Domestic/intimate partner violence |
| sexual_violence | Rape, sexual assault, coercion |
| exploitation | Trafficking, grooming, sextortion |
| stalking | Persistent unwanted contact |
| neglect | Child or elder neglect |
Severity Levels
| Level | Meaning |
|---|---|
| mild | Concerning but not immediate |
| moderate | Elevated risk, warrants attention |
| high | Serious or imminent risk |
| critical | Immediate intervention needed |
Imminence
| Level | Meaning |
|---|---|
| chronic | Ongoing pattern, no immediate timeline |
| acute | Current crisis episode |
| urgent | Within hours or days |
| emergency | Immediate (minutes) |
Subject Attribution
| Subject | Meaning | Example |
|---|---|---|
| self | Speaker is at risk or is the victim | "I want to kill myself", "My partner hits me" |
| other | Speaker reporting concern about someone else | "My friend said she wants to die" |
Parsing Output
import re
from dataclasses import dataclass
from typing import Optional
@dataclass
class Risk:
subject: str
type: str
severity: str
imminence: Optional[str] = None
features: Optional[list] = None
def parse_output(output: str) -> dict:
"""Parse model output into structured data."""
result = {"reflection": None, "risks": [], "is_crisis": False}
# Extract reflection
reflection_match = re.search(r'<reflection>(.*?)</reflection>', output, re.DOTALL)
if reflection_match:
result["reflection"] = reflection_match.group(1).strip()
# Check for empty risks (no crisis)
if '<risks/>' in output or '<risks />' in output:
return result
# Extract risk elements
risk_pattern = r'<risk\s+([^>]+)/?\s*>'
for match in re.finditer(risk_pattern, output):
attrs = {}
for attr_match in re.finditer(r'(\w+)="([^"]*)"', match.group(1)):
attrs[attr_match.group(1)] = attr_match.group(2)
if attrs:
risk = Risk(
subject=attrs.get("subject", "self"),
type=attrs.get("type"),
severity=attrs.get("severity"),
imminence=attrs.get("imminence"),
features=attrs.get("features", "").split(",") if attrs.get("features") else None
)
result["risks"].append(risk)
result["is_crisis"] = True
return resultInput Best Practices
Text Preprocessing
Preserve natural prose. The model was trained on real conversations with authentic expression. Emotional signals matter:
- Emojis:
💀in "kms 💀" signals irony;ðŸ˜signals distress intensity - Punctuation: "I can't do this!!!" conveys more urgency than "I can't do this"
- Casual spelling: "im so done" vs "I'm so done" — both valid, don't normalize
- Slang: "kms", "unalive", "catch the bus" — model understands these
Only remove: Zero-width/invisible Unicode characters, decorative Unicode fonts, newlines (single messages only).
import re
import unicodedata
def preprocess(text: str) -> str:
# Normalize decorative Unicode fonts to ASCII (NFKC)
text = unicodedata.normalize('NFKC', text)
# Remove zero-width and invisible characters
text = re.sub(r'[\u200b-\u200f\u2028-\u202f\u2060-\u206f\ufeff]', '', text)
# Flatten newlines to spaces (for single messages only)
text = re.sub(r'\n+', ' ', text)
# Collapse multiple spaces
text = re.sub(r' +', ' ', text)
return text.strip()
# NOTE: Do NOT remove emojis, punctuation, or "normalize" spellingMulti-Turn Conversations
The model was trained on pre-serialized transcripts, not native multi-turn chat format. When classifying conversations, serialize into a single user message:
# For multi-turn conversations, serialize into single message
conversation = """User: How are you?
Assistant: I'm here to help. How are you feeling?
User: Not great. I've been thinking about ending it all."""
# CORRECT - single user message with serialized conversation
messages = [{"role": "user", "content": conversation}]
# Model classifies the entire conversation
result = classify(conversation)
# → <reflection>...</reflection><risks><risk .../></risks>Local Inference (Ollama)
For local development or low-volume use, run with Ollama using our GGUF models:
# Download GGUF and Modelfile from HuggingFace
huggingface-cli download nopenet/nope-edge-mini-GGUF \
nope-edge-mini-q8_0.gguf Modelfile --local-dir .
# Create the model (Modelfile has correct template)
ollama create nope-edge-mini -f Modelfile
# Run
ollama run nope-edge-mini "I want to end it all"
# → <reflection>...</reflection><risks><risk .../></risks># Or call via API
curl http://localhost:11434/api/generate -d '{
"model": "nope-edge-mini",
"prompt": "I want to end it all",
"stream": false
}'Hardware Requirements
| Model | Q8_0 Size | RAM/VRAM | Notes |
|---|---|---|---|
| nope-edge-mini-GGUF | ~2GB | 2GB | Runs on CPU or any GPU |
| nope-edge-GGUF | ~5GB | 5GB | GPU recommended |
CPU inference works but is slower (~1-2s vs ~100ms on GPU). Q8_0 is lossless. Q4_K_M available for smaller footprint (~8% accuracy loss).
Note: Ollama lacks continuous batching. For high-throughput production (50+ req/s), use vLLM or SGLang below.
Production Deployment
For high throughput (50+ requests/second), use vLLM or SGLang:
# vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model nopenet/nope-edge \
--dtype bfloat16 \
--max-model-len 2048 \
--port 8000
# Or SGLang
pip install sglang
python -m sglang.launch_server \
--model nopenet/nope-edge \
--dtype bfloat16 \
--port 8000Then call the OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "nopenet/nope-edge",
"messages": [{"role": "user", "content": "I want to end it all"}],
"max_tokens": 300,
"temperature": 0
}'Performance
Benchmarks for nope-edge (4B) on NVIDIA A10G:
| Setup | Hardware | Throughput | Latency (p50) |
|---|---|---|---|
| vLLM / SGLang | A10G (24GB) | 50-100+ req/sec | ~50ms |
| transformers | A10G (24GB) | ~8 req/sec | ~200ms |
| Ollama (GGUF) | CPU / Consumer GPU | 1-5 req/sec | ~200ms–2s |
Latency = server-side inference time (excludes network). Throughput assumes continuous batching where applicable.
Licensing
NOPE Edge is available under the NOPE Edge Community License v1.0:
- Free: Research, academic, nonprofit, personal, and evaluation use
- Commercial: Production deployment in revenue-generating products requires a separate license
See the full license terms on HuggingFace. For commercial licensing, contact [email protected].
Important Notes
- Precision: Use
torch.bfloat16— float16 may cause numerical instability - Classification only: Edge provides raw classification. Crisis resources, rationale, and audit logging are only available via the Evaluate API.
- Not diagnostic: This is a triage tool, not a clinical assessment. Use for flagging messages for human review.
- False positives/negatives will occur. Never use as the sole basis for intervention decisions.
Ensemble Strategies
LLM-based classifiers exhibit non-determinism — the same input can produce different outputs across runs. Ensemble strategies average out this variance to improve reliability. There are two main approaches:
Panel Consensus
Run N judges in parallel and aggregate results. We recommend hybrid consensus:
- Majority vote for crisis detection — >50% must flag to reduce false positives from variance
- MAX severity within confirmed crises — if majority detects, take highest severity
- Union imminence — take most urgent timeline
Note: Ensembles work best with the 4B model. The 1.7B mini model has higher per-call variance, and ensembles can amplify false positives. For mini, single-call is often preferred.
import asyncio
from dataclasses import dataclass
@dataclass
class RiskOutput:
flagged: bool
severity: str # none, mild, moderate, high, critical
risks: list
def severity_index(sev: str) -> int:
return ["none", "mild", "moderate", "high", "critical"].index(sev)
async def panel_classify(message: str, judges: int = 3) -> RiskOutput:
"""Run multiple judges in parallel, aggregate with hybrid consensus."""
# Run judges concurrently
tasks = [asyncio.to_thread(classify, message) for _ in range(judges)]
outputs = [parse_output(out) for out in await asyncio.gather(*tasks)]
# MAX severity (fail-safe: if any judge sees critical, report critical)
max_severity = max(
(o.get("severity", "none") for o in outputs),
key=severity_index
)
# 2/3 majority for risk types (reduces noise)
risk_counts = {}
for o in outputs:
for risk in o.get("risks", []):
key = f"{risk.type}:{risk.subject}"
risk_counts[key] = risk_counts.get(key, 0) + 1
threshold = judges * 2 / 3
consensus_risks = [r for r, count in risk_counts.items() if count >= threshold]
return RiskOutput(
flagged=max_severity != "none",
severity=max_severity,
risks=consensus_risks
)Tradeoffs: 3 judges = 3x inference cost, but runs in parallel so latency ≈ single call + overhead (~30%).
Dual Triage Cascade
For cost-sensitive deployments with high benign traffic, use a cascade: run 2 triage calls in parallel, only trigger the full panel if either flags. Most real-world traffic is benign and exits after 2 calls.
Tradeoff: Cascade saves ~33% on benign traffic but adds latency for flagged cases (sequential phases). Use 3-panel majority if latency matters more than cost.
async def dual_triage_cascade(message: str) -> RiskOutput:
"""
Dual triage cascade: 2 sensitive seeds in parallel at triage,
trigger panel only if either flags. Balances recall with cost.
"""
# Clinical patterns + behavioral signals (orthogonal coverage)
seeds = [
"(Watch for: farewell language, giving away possessions, sudden calm, method references)",
"(Watch for: behavioral changes, withdrawal, hopelessness, burden statements)"
]
# Step 1: Run both triage judges in parallel
triage_tasks = [
asyncio.to_thread(classify, f"{seed}\n\n{message}")
for seed in seeds
]
triage_outputs = [parse_output(out) for out in await asyncio.gather(*triage_tasks)]
# Step 2: If NEITHER flags, fast exit (most traffic is benign)
max_triage_severity = max(
(o.get("severity", "none") for o in triage_outputs),
key=severity_index
)
if max_triage_severity == "none":
return RiskOutput(flagged=False, severity="none", risks=[])
# Step 3: At least one flagged - run baseline follow-up
baseline = parse_output(classify(message))
# Step 4: Aggregate all 3 with hybrid consensus
all_outputs = triage_outputs + [baseline]
# ... same aggregation as panel_classify abovePerformance Comparison
Benchmarks on nope-edge (4B) from our public test suite:
| Strategy | Recall (TPR) | Precision (TNR) | Accuracy | Calls |
|---|---|---|---|---|
| Single call | ~92% | ~75% | ~88% | 1 |
| 3-panel (majority) | ~100% | ~83% | ~92% | 3 |
| 3-panel (2/3 majority) | ~92% | ~92% | ~96% | 3 |
Use majority voting for maximum recall, 2/3 majority for balanced precision/recall. Panels run in parallel, so latency ≈ single call + ~20% overhead.
Seed Variation
For fine-tuned models without system prompts, "seeding" means varying the random seed parameter to get diverse samples from the model's distribution. Different seeds produce different outputs due to sampling variance, which panels then aggregate.
Note: Orthogonal prompt-based approaches (clinical vs behavioral focus) require system prompt customization, which is not supported by the base Edge models. The fine-tuned models learn detection patterns from training data, not runtime instructions.
Quantization
AWQ 4-bit quantized versions are available for experimentation but not recommended for production. Testing shows ~15-20% accuracy degradation that ensembles cannot compensate for.
- Use FP16 for production — quality tradeoff isn't worth the latency savings
- If cost-constrained, use 1.7B FP16 — better accuracy than 4B quantized
- SGLang requires symmetric quantization — use
symmetric=Truein llm-compressor
Get Access
Download directly from HuggingFace:
- nopenet/nope-edge (4B, maximum accuracy)
- nopenet/nope-edge-mini (1.7B, faster)
For Ollama/llama.cpp, use the GGUF versions:
- nopenet/nope-edge-GGUF (4B)
- nopenet/nope-edge-mini-GGUF (1.7B)
For commercial licensing or technical support, contact [email protected].