Edge Self-Hosted

Run NOPE's safety classifier on your own infrastructure using our fine-tuned models. No data leaves your network.

Model Variants

Model	Parameters	Latency	Use Case
nope-edge	4B	~750ms	Maximum accuracy
nope-edge-mini	1.7B	~260ms	High-volume, cost-sensitive

Requirements

GPU: Any NVIDIA GPU with bfloat16 support (RTX 3060+, L4, A10G, A100, H100)
VRAM: ~4GB (mini) or ~8GB (full)
Python: 3.10+ with PyTorch and transformers

Note: T4 GPUs are not supported (bfloat16 requires compute capability 8.0+).

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "nopenet/nope-edge"  # or "nopenet/nope-edge-mini"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

def classify(message: str) -> str:
    """Returns XML with reflection and risk classification.

    For multi-turn, serialize the whole exchange into this one string
    (one "User:"/"AI:" turn per line) — Edge expects a single serialized user message.
    """
    inputs = tokenizer.apply_chat_template(
        [{"role": "user", "content": message}],
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        output = model.generate(**inputs, max_new_tokens=300, do_sample=False)

    return tokenizer.decode(
        output[0][inputs["input_ids"].shape[1]:],
        skip_special_tokens=True
    ).strip()

# Example
result = classify("I want to end it all tonight")
print(result)
# Output:
# <reflection>The user directly expresses intent to end their life...</reflection>
# <risks>
#   <risk subject="self" type="suicide" severity="high" imminence="urgent"/>
# </risks>

Output Format

The model outputs XML with two components: a <reflection> explaining the reasoning, and <risks> containing structured risk elements (or <risks/> if no crisis detected).

# No risk detected
<reflection>The user is sharing a positive update about work...</reflection>
<risks/>

# Risk detected
<reflection>The user directly expresses intent to end their life...</reflection>
<risks>
  <risk subject="self" type="suicide" severity="high" imminence="urgent"/>
</risks>

# Third-party concern
<reflection>The user is reporting concern about a friend...</reflection>
<risks>
  <risk subject="other" type="self_harm" severity="moderate" imminence="chronic"/>
</risks>

Risk Attributes

Attribute	Description
subject	Who is at risk (`self`, `other`, or rarely `unknown`)
type	Risk category (see below)
severity	Urgency level (see below)
imminence	Time sensitivity (see below)
features	Comma-separated specific indicators (optional)

Risk Types

Type	Description
suicide	Suicidal ideation, plans, or intent
self_harm	Non-suicidal self-injury
self_neglect	Eating disorders, medical neglect
violence	Threats or plans of violence toward others
abuse	Domestic/intimate partner violence
sexual_violence	Rape, sexual assault, coercion
exploitation	Trafficking, grooming, sextortion
stalking	Persistent unwanted contact
neglect	Child or elder neglect

Severity Levels

Level	Meaning
none	No concern detected
mild	Concerning but not immediate
moderate	Elevated risk, warrants attention
high	Serious; warrants prompt review
critical	Requires immediate human review

Imminence

Level	Meaning
not_applicable	No timeline applies (e.g. no risk detected)
chronic	Ongoing pattern, no immediate timeline
subacute	Likely escalation in days to weeks
urgent	Within hours or days
emergency	Immediate (minutes)

Subject Attribution

Subject	Meaning	Example
self	Speaker is at risk or is the victim	"I want to kill myself", "My partner hits me"
other	Speaker reporting concern about someone else	"My friend said she wants to die"
unknown	Attribution ambiguous (the model may emit this; treat conservatively)	"Someone I know is in danger"

Parsing Output

import re
from dataclasses import dataclass
from typing import Optional

@dataclass
class Risk:
    subject: str
    type: str
    severity: str
    imminence: Optional[str] = None
    features: Optional[list] = None

def parse_output(output: str) -> dict:
    """Parse model output into structured data."""
    result = {"reflection": None, "risks": [], "is_crisis": False}

    # Extract reflection
    reflection_match = re.search(r'<reflection>(.*?)</reflection>', output, re.DOTALL)
    if reflection_match:
        result["reflection"] = reflection_match.group(1).strip()

    # Check for empty risks (no crisis)
    if '<risks/>' in output or '<risks />' in output:
        return result

    # Valid risk types — anything else (incl. a stray type="none") is dropped
    VALID_TYPES = {
        "suicide", "self_harm", "self_neglect", "violence", "abuse",
        "sexual_violence", "neglect", "exploitation", "stalking",
    }

    # Extract risk elements
    risk_pattern = r'<risk\s+([^>]+)/?\s*>'
    for match in re.finditer(risk_pattern, output):
        attrs = {}
        for attr_match in re.finditer(r'(\w+)="([^"]*)"', match.group(1)):
            attrs[attr_match.group(1)] = attr_match.group(2)

        # Mirror NOPE's parser: skip non-risk elements
        if attrs.get("type") not in VALID_TYPES:
            continue
        if attrs.get("severity", "none") == "none":
            continue

        result["risks"].append(Risk(
            subject=attrs.get("subject", "unknown"),
            type=attrs["type"],
            severity=attrs["severity"],
            imminence=attrs.get("imminence", "not_applicable"),
            features=attrs.get("features", "").split(",") if attrs.get("features") else None
        ))
        result["is_crisis"] = True

    return result

Input Best Practices

Text Preprocessing

Preserve natural prose. The model was trained on real conversations with authentic expression. Emotional signals matter:

Emojis: 💀 in "kms 💀" signals irony; 😭 signals distress intensity
Punctuation: "I can't do this!!!" conveys more urgency than "I can't do this"
Casual spelling: "im so done" vs "I'm so done" — both valid, don't normalize
Slang: "kms", "unalive", "catch the bus" — model understands these

Only remove: Zero-width/invisible Unicode characters, decorative Unicode fonts, newlines (single messages only).

import re
import unicodedata

def preprocess(text: str) -> str:
    # Normalize decorative Unicode fonts to ASCII (NFKC)
    text = unicodedata.normalize('NFKC', text)

    # Remove zero-width and invisible characters
    text = re.sub(r'[\u200b-\u200f\u2028-\u202f\u2060-\u206f\ufeff]', '', text)

    # Flatten newlines to spaces (for single messages only)
    text = re.sub(r'\n+', ' ', text)

    # Collapse multiple spaces
    text = re.sub(r' +', ' ', text)

    return text.strip()

# NOTE: Do NOT remove emojis, punctuation, or "normalize" spelling

Multi-Turn Conversations

The model was trained on pre-serialized transcripts, not native multi-turn chat format. When classifying conversations, serialize into a single user message:

# For multi-turn conversations, serialize into single message
conversation = """User: How are you?
Assistant: I'm here to help. How are you feeling?
User: Not great. I've been thinking about ending it all."""

# CORRECT - single user message with serialized conversation
messages = [{"role": "user", "content": conversation}]

# Model classifies the entire conversation
result = classify(conversation)
# → <reflection>...</reflection><risks><risk .../></risks>

Local Inference (Ollama)

For local development or low-volume use, run with Ollama using our GGUF models:

# Download GGUF and Modelfile from HuggingFace
huggingface-cli download nopenet/nope-edge-mini-GGUF \
    nope-edge-mini-q8_0.gguf Modelfile --local-dir .

# Create the model (Modelfile has correct template)
ollama create nope-edge-mini -f Modelfile

# Run
ollama run nope-edge-mini "I want to end it all"
# → <reflection>...</reflection><risks><risk .../></risks>

# Or call via API
curl http://localhost:11434/api/generate -d '{
  "model": "nope-edge-mini",
  "prompt": "I want to end it all",
  "stream": false
}'

Hardware Requirements

Model	Q8_0 Size	RAM/VRAM	Notes
nope-edge-mini-GGUF	~2GB	2GB	Runs on CPU or any GPU
nope-edge-GGUF	~5GB	5GB	GPU recommended

CPU inference works but is slower (~1-2s vs ~100ms on GPU). Q8_0 is lossless. Q4_K_M available for smaller footprint (~8% accuracy loss).

Note: Ollama lacks continuous batching. For high-throughput production (50+ req/s), use vLLM or SGLang below.

Production Deployment

For high throughput (50+ requests/second), use vLLM or SGLang:

# vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
    --model nopenet/nope-edge \
    --dtype bfloat16 \
    --max-model-len 2048 \
    --port 8000

# Or SGLang
pip install sglang
python -m sglang.launch_server \
    --model nopenet/nope-edge \
    --dtype bfloat16 \
    --port 8000

Then call the OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nopenet/nope-edge",
    "messages": [{"role": "user", "content": "I want to end it all"}],
    "max_tokens": 300,
    "temperature": 0
  }'

Performance

Benchmarks for nope-edge (4B) on NVIDIA A10G:

Setup	Hardware	Throughput	Latency (p50)
vLLM / SGLang	A10G (24GB)	50-100+ req/sec	~50ms
transformers	A10G (24GB)	~8 req/sec	~200ms
Ollama (GGUF)	CPU / Consumer GPU	1-5 req/sec	~200ms–2s

Latency = server-side inference time (excludes network). Throughput assumes continuous batching where applicable.

Licensing

NOPE Edge open weights are released under the MIT License — free for any use, including commercial.

See the full LICENSE on HuggingFace. For technical support, contact [email protected].

Edge is a detection aid — not a predictive, diagnostic, or therapeutic tool, and not a replacement for clinical judgment. It surfaces signals in text for a human to review; it is not a medical device, not clinically validated, and not a crisis or emergency service. False positives and false negatives will occur — some people in genuine crisis will not be identified — so never use Edge as the sole basis for an intervention decision, and always keep a human in the loop. If anyone is in immediate danger, contact your local emergency services or find resources at lines.talk.help.

Important Notes

Precision: Use torch.bfloat16 — float16 may cause numerical instability
Classification only: Edge provides raw classification. Crisis resources, rationale, and audit logging are only available via the Evaluate API.
Not diagnostic: Edge provides raw behavioral classification, not a clinical assessment. Use for flagging messages for human review.
False positives/negatives will occur. Never use as the sole basis for intervention decisions.

Ensemble Strategies

LLM-based classifiers exhibit non-determinism — the same input can produce different outputs across runs. Ensemble strategies average out this variance to improve reliability. There are two main approaches:

Panel Consensus

Run N judges in parallel and aggregate results. We recommend hybrid consensus:

Majority vote for crisis detection — >50% must flag to reduce false positives from variance
MAX severity within confirmed crises — if majority detects, take highest severity
Union imminence — take most urgent timeline

Note: Ensembles work best with the 4B model. The 1.7B mini model has higher per-call variance, and ensembles can amplify false positives. For mini, single-call is often preferred.

import asyncio
from dataclasses import dataclass

@dataclass
class RiskOutput:
    flagged: bool
    severity: str  # none, mild, moderate, high, critical
    risks: list

def severity_index(sev: str) -> int:
    return ["none", "mild", "moderate", "high", "critical"].index(sev)

async def panel_classify(message: str, judges: int = 3) -> RiskOutput:
    """Run multiple judges in parallel, aggregate with hybrid consensus."""
    # Run judges concurrently
    tasks = [asyncio.to_thread(classify, message) for _ in range(judges)]
    outputs = [parse_output(out) for out in await asyncio.gather(*tasks)]

    # MAX severity (fail-safe: if any judge sees critical, report critical)
    max_severity = max(
        (o.get("severity", "none") for o in outputs),
        key=severity_index
    )

    # 2/3 majority for risk types (reduces noise)
    risk_counts = {}
    for o in outputs:
        for risk in o.get("risks", []):
            key = f"{risk.type}:{risk.subject}"
            risk_counts[key] = risk_counts.get(key, 0) + 1

    threshold = judges * 2 / 3
    consensus_risks = [r for r, count in risk_counts.items() if count >= threshold]

    return RiskOutput(
        flagged=max_severity != "none",
        severity=max_severity,
        risks=consensus_risks
    )

Tradeoffs: 3 judges = 3x inference cost, but runs in parallel so latency ≈ single call + ~20% overhead.

Dual Triage Cascade

For cost-sensitive deployments with high benign traffic, use a cascade: run 2 triage calls in parallel, only trigger the full panel if either flags. Most real-world traffic is benign and exits after 2 calls.

Tradeoff: Cascade saves ~33% on benign traffic but adds latency for flagged cases (sequential phases). Use 3-panel majority if latency matters more than cost.

async def dual_triage_cascade(message: str) -> RiskOutput:
    """
    Dual triage cascade: 2 sensitive seeds in parallel at triage,
    trigger panel only if either flags. Balances recall with cost.
    """
    # Clinical patterns + behavioral signals (orthogonal coverage)
    seeds = [
        "(Watch for: farewell language, giving away possessions, sudden calm, method references)",
        "(Watch for: behavioral changes, withdrawal, hopelessness, burden statements)"
    ]

    # Step 1: Run both triage judges in parallel
    triage_tasks = [
        asyncio.to_thread(classify, f"{seed}\n\n{message}")
        for seed in seeds
    ]
    triage_outputs = [parse_output(out) for out in await asyncio.gather(*triage_tasks)]

    # Step 2: If NEITHER flags, fast exit (most traffic is benign)
    max_triage_severity = max(
        (o.get("severity", "none") for o in triage_outputs),
        key=severity_index
    )
    if max_triage_severity == "none":
        return RiskOutput(flagged=False, severity="none", risks=[])

    # Step 3: At least one flagged - run baseline follow-up
    baseline = parse_output(classify(message))

    # Step 4: Aggregate all 3 with hybrid consensus
    all_outputs = triage_outputs + [baseline]
    # ... same aggregation as panel_classify above

Performance Comparison

Benchmarks on nope-edge (4B) from our public test suite:

Strategy	Recall (TPR)	Specificity (TNR)	Accuracy	Calls
Single call	~92%	~75%	~88%	1
3-panel (majority)	~100%	~83%	~92%	3
3-panel (2/3 majority)	~92%	~92%	~96%	3

Use majority voting for maximum recall, 2/3 majority for balanced precision/recall. Panels run in parallel, so latency ≈ single call + ~20% overhead.

Seed Variation

For fine-tuned models without system prompts, "seeding" means varying the random seed parameter to get diverse samples from the model's distribution. Different seeds produce different outputs due to sampling variance, which panels then aggregate.

Note: Orthogonal prompt-based approaches (clinical vs behavioral focus) require system prompt customization, which is not supported by the base Edge models. The fine-tuned models learn detection patterns from training data, not runtime instructions.

Quantization

AWQ 4-bit quantized versions are available for experimentation but not recommended for production. Testing shows ~15-20% accuracy degradation that ensembles cannot compensate for.

Use FP16 for production — quality tradeoff isn't worth the latency savings
If cost-constrained, use 1.7B FP16 — better accuracy than 4B quantized
SGLang requires symmetric quantization — use symmetric=True in llm-compressor

Get Access

Download directly from HuggingFace:

nopenet/nope-edge (4B, maximum accuracy)
nopenet/nope-edge-mini (1.7B, faster)

For Ollama/llama.cpp, use the GGUF versions:

nopenet/nope-edge-GGUF (4B)
nopenet/nope-edge-mini-GGUF (1.7B)

For technical support, contact [email protected].