Skip to main content

Edge Self-Hosted

Run NOPE's safety classifier on your own infrastructure using our fine-tuned models. No data leaves your network.

Model Variants

ModelParametersLatencyUse Case
nope-edge4B~750msMaximum accuracy
nope-edge-mini1.7B~260msHigh-volume, cost-sensitive

Requirements

  • GPU: Any NVIDIA GPU with bfloat16 support (RTX 3060+, L4, A10G, A100, H100)
  • VRAM: ~4GB (mini) or ~8GB (full)
  • Python: 3.10+ with PyTorch and transformers

Note: T4 GPUs are not supported (bfloat16 requires compute capability 8.0+).

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "nopenet/nope-edge"  # or "nopenet/nope-edge-mini"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

def classify(message: str) -> str:
    """Returns XML with reflection and risk classification."""
    input_ids = tokenizer.apply_chat_template(
        [{"role": "user", "content": message}],
        tokenize=True,
        return_tensors="pt",
        add_generation_prompt=True
    ).to(model.device)

    with torch.no_grad():
        output = model.generate(input_ids, max_new_tokens=300, do_sample=False)

    return tokenizer.decode(
        output[0][input_ids.shape[1]:],
        skip_special_tokens=True
    ).strip()

# Example
result = classify("I want to end it all tonight")
print(result)
# Output:
# <reflection>The user directly expresses intent to end their life...</reflection>
# <risks>
#   <risk subject="self" type="suicide" severity="high" imminence="urgent"/>
# </risks>

Output Format

The model outputs XML with two components: a <reflection> explaining the reasoning, and <risks> containing structured risk elements (or <risks/> if no crisis detected).

# No risk detected
<reflection>The user is sharing a positive update about work...</reflection>
<risks/>

# Risk detected
<reflection>The user directly expresses intent to end their life...</reflection>
<risks>
  <risk subject="self" type="suicide" severity="high" imminence="urgent"/>
</risks>

# Third-party concern
<reflection>The user is reporting concern about a friend...</reflection>
<risks>
  <risk subject="other" type="self_harm" severity="moderate" imminence="chronic"/>
</risks>

Risk Attributes

AttributeDescription
subjectWho is at risk (self or other)
typeRisk category (see below)
severityUrgency level (see below)
imminenceTime sensitivity (see below)
featuresComma-separated specific indicators (optional)

Risk Types

TypeDescription
suicideSuicidal ideation, plans, or intent
self_harmNon-suicidal self-injury
self_neglectEating disorders, medical neglect
violenceThreats or plans of violence toward others
abuseDomestic/intimate partner violence
sexual_violenceRape, sexual assault, coercion
exploitationTrafficking, grooming, sextortion
stalkingPersistent unwanted contact
neglectChild or elder neglect

Severity Levels

LevelMeaning
mildConcerning but not immediate
moderateElevated risk, warrants attention
highSerious or imminent risk
criticalImmediate intervention needed

Imminence

LevelMeaning
chronicOngoing pattern, no immediate timeline
acuteCurrent crisis episode
urgentWithin hours or days
emergencyImmediate (minutes)

Subject Attribution

SubjectMeaningExample
selfSpeaker is at risk or is the victim"I want to kill myself", "My partner hits me"
otherSpeaker reporting concern about someone else"My friend said she wants to die"

Parsing Output

import re
from dataclasses import dataclass
from typing import Optional

@dataclass
class Risk:
    subject: str
    type: str
    severity: str
    imminence: Optional[str] = None
    features: Optional[list] = None

def parse_output(output: str) -> dict:
    """Parse model output into structured data."""
    result = {"reflection": None, "risks": [], "is_crisis": False}

    # Extract reflection
    reflection_match = re.search(r'<reflection>(.*?)</reflection>', output, re.DOTALL)
    if reflection_match:
        result["reflection"] = reflection_match.group(1).strip()

    # Check for empty risks (no crisis)
    if '<risks/>' in output or '<risks />' in output:
        return result

    # Extract risk elements
    risk_pattern = r'<risk\s+([^>]+)/?\s*>'
    for match in re.finditer(risk_pattern, output):
        attrs = {}
        for attr_match in re.finditer(r'(\w+)="([^"]*)"', match.group(1)):
            attrs[attr_match.group(1)] = attr_match.group(2)
        if attrs:
            risk = Risk(
                subject=attrs.get("subject", "self"),
                type=attrs.get("type"),
                severity=attrs.get("severity"),
                imminence=attrs.get("imminence"),
                features=attrs.get("features", "").split(",") if attrs.get("features") else None
            )
            result["risks"].append(risk)
            result["is_crisis"] = True

    return result

Input Best Practices

Text Preprocessing

Preserve natural prose. The model was trained on real conversations with authentic expression. Emotional signals matter:

  • Emojis: 💀 in "kms 💀" signals irony; 😭 signals distress intensity
  • Punctuation: "I can't do this!!!" conveys more urgency than "I can't do this"
  • Casual spelling: "im so done" vs "I'm so done" — both valid, don't normalize
  • Slang: "kms", "unalive", "catch the bus" — model understands these

Only remove: Zero-width/invisible Unicode characters, decorative Unicode fonts, newlines (single messages only).

import re
import unicodedata

def preprocess(text: str) -> str:
    # Normalize decorative Unicode fonts to ASCII (NFKC)
    text = unicodedata.normalize('NFKC', text)

    # Remove zero-width and invisible characters
    text = re.sub(r'[\u200b-\u200f\u2028-\u202f\u2060-\u206f\ufeff]', '', text)

    # Flatten newlines to spaces (for single messages only)
    text = re.sub(r'\n+', ' ', text)

    # Collapse multiple spaces
    text = re.sub(r' +', ' ', text)

    return text.strip()

# NOTE: Do NOT remove emojis, punctuation, or "normalize" spelling

Multi-Turn Conversations

The model was trained on pre-serialized transcripts, not native multi-turn chat format. When classifying conversations, serialize into a single user message:

# For multi-turn conversations, serialize into single message
conversation = """User: How are you?
Assistant: I'm here to help. How are you feeling?
User: Not great. I've been thinking about ending it all."""

# CORRECT - single user message with serialized conversation
messages = [{"role": "user", "content": conversation}]

# Model classifies the entire conversation
result = classify(conversation)
# → <reflection>...</reflection><risks><risk .../></risks>

Local Inference (Ollama)

For local development or low-volume use, run with Ollama using our GGUF models:

# Download GGUF and Modelfile from HuggingFace
huggingface-cli download nopenet/nope-edge-mini-GGUF \
    nope-edge-mini-q8_0.gguf Modelfile --local-dir .

# Create the model (Modelfile has correct template)
ollama create nope-edge-mini -f Modelfile

# Run
ollama run nope-edge-mini "I want to end it all"
# → <reflection>...</reflection><risks><risk .../></risks>
# Or call via API
curl http://localhost:11434/api/generate -d '{
  "model": "nope-edge-mini",
  "prompt": "I want to end it all",
  "stream": false
}'

Hardware Requirements

ModelQ8_0 SizeRAM/VRAMNotes
nope-edge-mini-GGUF~2GB2GBRuns on CPU or any GPU
nope-edge-GGUF~5GB5GBGPU recommended

CPU inference works but is slower (~1-2s vs ~100ms on GPU). Q8_0 is lossless. Q4_K_M available for smaller footprint (~8% accuracy loss).

Note: Ollama lacks continuous batching. For high-throughput production (50+ req/s), use vLLM or SGLang below.

Production Deployment

For high throughput (50+ requests/second), use vLLM or SGLang:

# vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
    --model nopenet/nope-edge \
    --dtype bfloat16 \
    --max-model-len 2048 \
    --port 8000

# Or SGLang
pip install sglang
python -m sglang.launch_server \
    --model nopenet/nope-edge \
    --dtype bfloat16 \
    --port 8000

Then call the OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nopenet/nope-edge",
    "messages": [{"role": "user", "content": "I want to end it all"}],
    "max_tokens": 300,
    "temperature": 0
  }'

Performance

Benchmarks for nope-edge (4B) on NVIDIA A10G:

SetupHardwareThroughputLatency (p50)
vLLM / SGLangA10G (24GB)50-100+ req/sec~50ms
transformersA10G (24GB)~8 req/sec~200ms
Ollama (GGUF)CPU / Consumer GPU1-5 req/sec~200ms–2s

Latency = server-side inference time (excludes network). Throughput assumes continuous batching where applicable.

Licensing

NOPE Edge is available under the NOPE Edge Community License v1.0:

  • Free: Research, academic, nonprofit, personal, and evaluation use
  • Commercial: Production deployment in revenue-generating products requires a separate license

See the full license terms on HuggingFace. For commercial licensing, contact [email protected].

Important Notes

  • Precision: Use torch.bfloat16 — float16 may cause numerical instability
  • Classification only: Edge provides raw classification. Crisis resources, rationale, and audit logging are only available via the Evaluate API.
  • Not diagnostic: This is a triage tool, not a clinical assessment. Use for flagging messages for human review.
  • False positives/negatives will occur. Never use as the sole basis for intervention decisions.

Ensemble Strategies

LLM-based classifiers exhibit non-determinism — the same input can produce different outputs across runs. Ensemble strategies average out this variance to improve reliability. There are two main approaches:

Panel Consensus

Run N judges in parallel and aggregate results. We recommend hybrid consensus:

  • Majority vote for crisis detection — >50% must flag to reduce false positives from variance
  • MAX severity within confirmed crises — if majority detects, take highest severity
  • Union imminence — take most urgent timeline

Note: Ensembles work best with the 4B model. The 1.7B mini model has higher per-call variance, and ensembles can amplify false positives. For mini, single-call is often preferred.

import asyncio
from dataclasses import dataclass

@dataclass
class RiskOutput:
    flagged: bool
    severity: str  # none, mild, moderate, high, critical
    risks: list

def severity_index(sev: str) -> int:
    return ["none", "mild", "moderate", "high", "critical"].index(sev)

async def panel_classify(message: str, judges: int = 3) -> RiskOutput:
    """Run multiple judges in parallel, aggregate with hybrid consensus."""
    # Run judges concurrently
    tasks = [asyncio.to_thread(classify, message) for _ in range(judges)]
    outputs = [parse_output(out) for out in await asyncio.gather(*tasks)]

    # MAX severity (fail-safe: if any judge sees critical, report critical)
    max_severity = max(
        (o.get("severity", "none") for o in outputs),
        key=severity_index
    )

    # 2/3 majority for risk types (reduces noise)
    risk_counts = {}
    for o in outputs:
        for risk in o.get("risks", []):
            key = f"{risk.type}:{risk.subject}"
            risk_counts[key] = risk_counts.get(key, 0) + 1

    threshold = judges * 2 / 3
    consensus_risks = [r for r, count in risk_counts.items() if count >= threshold]

    return RiskOutput(
        flagged=max_severity != "none",
        severity=max_severity,
        risks=consensus_risks
    )

Tradeoffs: 3 judges = 3x inference cost, but runs in parallel so latency ≈ single call + overhead (~30%).

Dual Triage Cascade

For cost-sensitive deployments with high benign traffic, use a cascade: run 2 triage calls in parallel, only trigger the full panel if either flags. Most real-world traffic is benign and exits after 2 calls.

Tradeoff: Cascade saves ~33% on benign traffic but adds latency for flagged cases (sequential phases). Use 3-panel majority if latency matters more than cost.

async def dual_triage_cascade(message: str) -> RiskOutput:
    """
    Dual triage cascade: 2 sensitive seeds in parallel at triage,
    trigger panel only if either flags. Balances recall with cost.
    """
    # Clinical patterns + behavioral signals (orthogonal coverage)
    seeds = [
        "(Watch for: farewell language, giving away possessions, sudden calm, method references)",
        "(Watch for: behavioral changes, withdrawal, hopelessness, burden statements)"
    ]

    # Step 1: Run both triage judges in parallel
    triage_tasks = [
        asyncio.to_thread(classify, f"{seed}\n\n{message}")
        for seed in seeds
    ]
    triage_outputs = [parse_output(out) for out in await asyncio.gather(*triage_tasks)]

    # Step 2: If NEITHER flags, fast exit (most traffic is benign)
    max_triage_severity = max(
        (o.get("severity", "none") for o in triage_outputs),
        key=severity_index
    )
    if max_triage_severity == "none":
        return RiskOutput(flagged=False, severity="none", risks=[])

    # Step 3: At least one flagged - run baseline follow-up
    baseline = parse_output(classify(message))

    # Step 4: Aggregate all 3 with hybrid consensus
    all_outputs = triage_outputs + [baseline]
    # ... same aggregation as panel_classify above

Performance Comparison

Benchmarks on nope-edge (4B) from our public test suite:

StrategyRecall (TPR)Precision (TNR)AccuracyCalls
Single call~92%~75%~88%1
3-panel (majority)~100%~83%~92%3
3-panel (2/3 majority)~92%~92%~96%3

Use majority voting for maximum recall, 2/3 majority for balanced precision/recall. Panels run in parallel, so latency ≈ single call + ~20% overhead.

Seed Variation

For fine-tuned models without system prompts, "seeding" means varying the random seed parameter to get diverse samples from the model's distribution. Different seeds produce different outputs due to sampling variance, which panels then aggregate.

Note: Orthogonal prompt-based approaches (clinical vs behavioral focus) require system prompt customization, which is not supported by the base Edge models. The fine-tuned models learn detection patterns from training data, not runtime instructions.

Quantization

AWQ 4-bit quantized versions are available for experimentation but not recommended for production. Testing shows ~15-20% accuracy degradation that ensembles cannot compensate for.

  • Use FP16 for production — quality tradeoff isn't worth the latency savings
  • If cost-constrained, use 1.7B FP16 — better accuracy than 4B quantized
  • SGLang requires symmetric quantization — use symmetric=True in llm-compressor

Get Access

Download directly from HuggingFace:

For Ollama/llama.cpp, use the GGUF versions:

For commercial licensing or technical support, contact [email protected].