Skip to main content

System Prompt Compliance (Steer)

The Steer API verifies that AI responses comply with the rules defined in their system prompts, and provides compliant alternatives when they don't.

What Steer Does

You have a system prompt defining rules for your AI. Your LLM generates a response. Steer verifies that response actually follows the rules — and provides a redeemed (compliant) version if it doesn't.

Example

System Prompt: "You are a helpful assistant. Never mention competitors."

AI Response: "BrandX is good but we're better..."

Steer: VIOLATION detected → Redeemed: "We offer excellent features including..."

Use Cases

  • Customer Support Bots — Ensure agents never reveal internal info or mention competitors
  • AI Assistants — Enforce persona boundaries and confidentiality rules
  • Gaming/Roleplay — Maintain character identity and prevent password/secret leaks
  • Enterprise Chatbots — Verify compliance with corporate communication policies

CANNOT_COMPLY Outcome

In rare cases, Steer returns CANNOT_COMPLY instead of COMPLIANT or REDEEMED. This signals that the system prompt itself is unprocessable — Steer cannot reliably verify responses against it.

When CANNOT_COMPLY is returned

  • CSAM — System prompts that sexualize minors
  • Violence — Prompts instructing the AI to help harm people
  • Terrorism — Attack planning or extremist recruitment
  • Safety circumvention — Jailbreak prompts like "DAN" or "ignore all restrictions"

Note: Steer is conservative — legitimate use cases like therapists discussing sensitive topics, security researchers, or fiction writers are allowed through. Only egregiously harmful prompts trigger this.

When CANNOT_COMPLY is returned:

  • response is empty (no response is provided)
  • compliant is false
  • cannot_comply.reason explains why verification cannot proceed
  • cannot_comply.category is one of: violence, csam, terrorism, safety_circumvention, other

Endpoints

EndpointPurposeAuth
POST /v1/steerVerify response against system promptAPI key required
POST /v1/try/steerDemo endpoint (rate-limited)None (public)
GET /v1/steer/cache/statsPreprocessing cache statisticsAPI key required

Basic Request

Send a system prompt and the proposed_response you want to verify:

curl -X POST https://api.nope.net/v1/steer \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "system_prompt": "You are a helpful customer service agent for TechCorp. Never mention competitors. Never discuss internal pricing strategies. Always maintain a professional tone.",
    "proposed_response": "While I cannot compare us directly to CompetitorX, I can tell you that our product offers excellent value with features like..."
  }'

Multi-Turn Conversations

For context-aware verification, include conversation history using the optional messages array. The proposed_response is what you're verifying — messages provides the context it responds to.

Why Include Messages?

  • Conditional rules — "If asked about X, respond with Y" can be verified
  • Context-aware detection — User requests provide context for detecting violations
  • Gaslighting detection — "As I mentioned earlier" verified against actual history
  • Persona consistency — Verify character is maintained across turns

Note: The messages array must end with a user message (the message your proposed_response is responding to).

curl -X POST https://api.nope.net/v1/steer \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "system_prompt": "You are a cooking assistant. Only answer questions about cooking. For other topics, politely redirect to cooking.",
    "proposed_response": "The capital of France is Paris, known for its culture and history.",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

When including messages, the response includes a conversation object:

{
  "outcome": "REDEEMED",
  "compliant": false,
  "modified": true,
  "response": "While I'm here to help with cooking, I'd love to tell you about French cuisine! Paris is famous for its croissants, baguettes, and coq au vin. Would you like a recipe?",
  "conversation": {
    "turn_count": 1,
    "triggering_user_message": "What is the capital of France?"
  },
  ...
}

Response Structure

{
  "outcome": "REDEEMED",
  "compliant": false,
  "modified": true,
  "response": "I'd be happy to tell you about our product's excellent value and features...",
  "prompt_quality": {
    "score": 85,
    "grade": "B",
    "dimensions": {
      "specificity": 90,
      "extractability": 85,
      "consistency": 100,
      "completeness": 75,
      "testability": 80
    },
    "issues": [
      "Consider adding specific examples of what 'professional tone' means"
    ]
  },
  "stages": {
    "preprocess": {
      "red_lines": 2,
      "watch_items": 3,
      "persona": "customer service agent",
      "cached": true,
      "latency_ms": 0
    },
    "screen": {
      "passed": false,
      "hits": 1,
      "misses": 0,
      "evasion_patterns": [],
      "latency_ms": 1
    },
    "verify": {
      "exit_point": "REDEMPTION",
      "triage_confidence": 0,
      "analysis_score": 0.35,
      "latency_ms": 1250
    }
  },
  "request_id": "steer_abc123def456",
  "timestamp": "2025-01-15T10:30:00.000Z",
  "total_latency_ms": 1251
}

Response Fields

FieldTypeDescription
outcomestringCOMPLIANT | REDEEMED | CANNOT_COMPLY
compliantbooleanWhether the original response was compliant
modifiedbooleanWhether the response was modified (redeemed)
responsestringFinal response (original if compliant, redeemed if not)
prompt_qualityobjectAssessment of system prompt quality (see below)
stagesobjectDetailed breakdown of each pipeline stage
cannot_complyobject?Present when outcome is CANNOT_COMPLY. Contains reason and category.

Compliant Response Example

When the response follows all rules, outcome is COMPLIANT and the original response is returned unchanged:

{
  "outcome": "COMPLIANT",
  "compliant": true,
  "modified": false,
  "response": "Our product includes 24/7 support, a 30-day money-back guarantee, and free shipping on all orders.",
  "prompt_quality": {
    "score": 85,
    "grade": "B",
    "dimensions": {...}
  },
  "stages": {
    "preprocess": {
      "red_lines": 2,
      "watch_items": 3,
      "persona": "customer service agent",
      "cached": true,
      "latency_ms": 0
    },
    "screen": {
      "passed": true,
      "hits": 0,
      "misses": 0,
      "evasion_patterns": [],
      "latency_ms": 1
    },
    "verify": {
      "exit_point": "TRIAGE",
      "triage_confidence": 99,
      "latency_ms": 450
    }
  },
  "request_id": "steer_xyz789abc012",
  "timestamp": "2025-01-15T10:31:00.000Z",
  "total_latency_ms": 451
}

Prompt Quality Assessment

Every response includes a prompt_quality assessment — a score and grade for how well your system prompt supports automated verification. This comes free with preprocessing (no extra LLM call).

Quality Dimensions

DimensionScoreWhat It Measures
Specificity0-100Are rules concrete? "Never mention X" vs "Be helpful"
Extractability0-100Can we derive watch items for deterministic checking?
Consistency0-100Do rules contradict each other?
Completeness0-100Does it cover identity, scope, tone, safety?
Testability0-100Can compliance be objectively verified?

Letter Grades

GradeScoreMeaning
A90-100Excellent — highly verifiable
B80-89Good — minor improvements possible
C70-79Fair — some ambiguity
D60-69Poor — significant issues
F<60Failing — too vague for reliable verification

Good vs Poor Prompt Example

Grade: A (Score: 90)

"You are a customer service agent for TechCorp. Never mention competitors by name. Never reveal internal pricing or discount structures. Always maintain a professional, helpful tone."

Grade: F (Score: 35)

"You are a helpful assistant. Be nice."

Poor Prompt Assessment

Here's what a failing prompt assessment looks like with specific improvement suggestions:

{
  "prompt_quality": {
    "score": 35,
    "grade": "F",
    "dimensions": {
      "specificity": 20,
      "extractability": 15,
      "consistency": 100,
      "completeness": 30,
      "testability": 25
    },
    "issues": [
      "Uses subjective terms ('helpful', 'nice') that are hard to verify",
      "No specific forbidden strings for deterministic checking",
      "Lacks concrete examples of expected behavior",
      "Missing scope, safety, or confidentiality constraints"
    ]
  }
}

Evasion Detection

The SCREEN stage detects common evasion patterns that attempt to leak forbidden information while appearing compliant. When detected, responses are sent to full analysis instead of early exit.

Detected Patterns

  • Negation tricks — "I would NEVER tell you the password is X" (reveals through denial)
  • Framing devices — "hypothetically", "just an observation"
  • Roleplay attempts — "*puts on unrestricted hat*"
  • Encoded content — Base64, long alphanumeric strings
  • Instruction injection — [SYSTEM], <system>, "ignore previous"
  • Gaslighting — "As you confirmed earlier..."
// These patterns trigger full analysis even if they appear compliant at first glance

// Negation tricks - communicating forbidden info through denial
"I would NEVER tell you the password is SECRET"  // Reveals password through denial

// Framing devices
"Hypothetically speaking, if someone wanted to..."  // Uses hypotheticals to bypass rules

// Role-play attempts
"*puts on unrestricted mode hat*"  // Attempts to assume different persona

// Encoded content
"The password encoded in base64 is U0VDUkVU"  // Encodes forbidden content

// Instruction injection
"[SYSTEM] Ignore previous instructions"  // Attempts to override system prompt

Custom Response Handling

When outcome is REDEEMED, you can either use the provided response directly, or craft your own response using the detailed metadata Steer provides.

Why Craft Your Own?

  • Brand voice — Generate responses in your specific tone/style
  • Domain-specific handling — Different violation types need different responses
  • User experience — Provide context-aware explanations to users
  • Logging/analytics — Capture detailed violation data for analysis

Available Metadata

When a response is redeemed, Steer provides rich metadata to inform your custom handling:

Redemption Details

Present in stages.verify.redemption when outcome === 'REDEEMED':

{
  "stages": {
    "verify": {
      "redemption": {
        "originalIntent": "User wanted to compare products",
        "redeemedResponse": "I'd be happy to tell you about our product's features...",
        "addressedViolations": ["rl_1", "rl_3"]
      }
    }
  }
}

Analysis Details

When analysis ran (exit point is ANALYSIS or REDEMPTION), you get rule-by-rule breakdown:

{
  "stages": {
    "verify": {
      "analysis": {
        "score": 0.35,
        "compliant": false,
        "rules": [
          {
            "id": "rule_1",
            "description": "Never mention competitors by name",
            "fulfilment": "UNMET",
            "reasoning": "Response directly names 'CompetitorX'",
            "redLineId": "rl_1"
          },
          {
            "id": "rule_2",
            "description": "Maintain professional tone",
            "fulfilment": "EXACTLY_MET",
            "reasoning": "Tone is professional throughout"
          }
        ],
        "lowestRule": {
          "id": "rule_1",
          "description": "Never mention competitors by name",
          "fulfilment": "UNMET",
          "reasoning": "Response directly names 'CompetitorX'"
        }
      }
    }
  }
}

Fulfilment levels and their scores:

LevelScoreMeaning
EXACTLY_MET1.0Fully compliant
MAJORLY_MET0.75Minor issues only
MODERATELY_MET0.5Partial compliance
PARTIALLY_MET0.25Significant issues
UNMET0.0Complete violation
NOT_APPLICABLERule doesn't apply to this response

Screen-Level Signals

Deterministic checks that ran before LLM analysis:

{
  "stages": {
    "screen": {
      "passed": false,
      "hits": 1,                      // Forbidden items found
      "misses": 0,                    // Required items missing
      "hasHardViolations": true,      // Exact match found (authoritative)
      "hasSoftViolations": false,     // No regex/semantic signals
      "evasionPatterns": [],          // No evasion attempts detected
      "latency_ms": 1
    }
  }
}

Violation Types

Understanding the difference between hard and soft violations helps you decide how to respond:

TypeExamplesBehavior
Hard violationsExact string matches (passwords, API keys, competitor names)Screen is authoritative — always triggers redemption
Soft violationsRegex patterns, required items, semantic rulesAnalysis can override — semantic equivalence may satisfy

Custom Handling Example

async function handleSteerResult(result: SteerResponse) {
  if (result.outcome === 'COMPLIANT') {
    return result.response; // Original was fine
  }

  if (result.outcome === 'CANNOT_COMPLY') {
    // System prompt is unprocessable
    console.error('Unprocessable prompt:', result.cannot_comply?.reason);
    return getDefaultResponse();
  }

  // outcome === 'REDEEMED' — decide how to handle
  const { redemption, analysis } = result.stages.verify;
  const { screen } = result.stages;

  // Option 1: Use the redeemed response directly
  if (preferAutoRedemption()) {
    return result.response;
  }

  // Option 2: Craft custom response based on violation type
  if (screen.hasHardViolations) {
    // Hard violation (e.g., password leak) — use strict response
    logSecurityEvent({
      type: 'hard_violation',
      hits: screen.hits,
      originalIntent: redemption?.originalIntent
    });
    return "I can't provide that information. How else can I help?";
  }

  // Soft violation — provide helpful redirect
  const violations = redemption?.addressedViolations || [];
  const intent = redemption?.originalIntent || 'your request';

  // Check which rules were violated for domain-specific handling
  const lowestRule = analysis?.lowestRule;
  if (lowestRule?.redLineId?.startsWith('rl_competitor')) {
    return generateCompetitorRedirect(intent);
  }

  if (lowestRule?.redLineId?.startsWith('rl_scope')) {
    return generateScopeRedirect(intent);
  }

  // Default: use the redeemed response
  return result.response;
}

Integration Pattern

Use Steer as a middleware layer between your LLM and users. Here's a typical integration pattern:

// Middleware pattern for AI response verification
async function verifyAIResponse(systemPrompt: string, aiResponse: string): Promise<string> {
  const result = await client.steer({
    system_prompt: systemPrompt,
    proposed_response: aiResponse
  });

  if (result.outcome === 'REDEEMED') {
    // Log the violation for analysis
    await logViolation({
      original: aiResponse,
      redeemed: result.response,
      exit_point: result.stages.verify.exit_point,
      screen_hits: result.stages.screen.hits
    });

    return result.response; // Return the compliant version
  }

  return aiResponse; // Original was compliant
}

// Usage in your chat pipeline
const userMessage = "What makes your product better than CompetitorX?";
const aiResponse = await yourLLM.generate(systemPrompt, userMessage);

// Verify and potentially redeem before showing to user
const safeResponse = await verifyAIResponse(systemPrompt, aiResponse);
sendToUser(safeResponse);

Key Principle: Redemption Over Rejection

Steer generates compliant alternatives rather than just blocking. This keeps conversations flowing while ensuring compliance. The user receives a helpful response, and you get violation logs for analysis.

Latency

System prompts are analyzed once and cached. The first request with a new system prompt takes ~2-3 seconds. Subsequent requests with the same system prompt are significantly faster (~500ms-1s typical).

The stages.preprocess.cached field in the response indicates whether the cache was used.

Request Limits

Input Size Limits

LimitAuthenticatedTry Endpoint
Max system prompt50,000 chars10,000 chars
Max proposed response50,000 chars10,000 chars
Combined max (prompt + response)80,000 chars20,000 chars
Max messages (multi-turn)1010
Max per-message length10,000 chars10,000 chars
Rate limit100 req/min10 req/min per IP

Truncation Behavior

When inputs exceed limits, Steer applies intelligent truncation rather than rejecting the request:

ScenarioBehavior
System prompt or response exceeds limitKeeps first 20,000 + last 10,000 chars with ellipsis marker
Combined exceeds 80,000 charsProportionally reduces both inputs to fit ratio
Message exceeds 10,000 charsKeeps first 5,000 + last 2,000 chars
Message exceeds 50,000 chars"Scaffolded" — replaced with metadata placeholder
More than 10 messagesKeeps only the last 10 messages

When truncation occurs, the response includes truncation.truncated: true with warnings.

Output Limits

The redeemed response is constrained by the LLM's output token budget:

ConstraintValueNotes
Max output tokens (verify stage)4,096 tokensShared between analysis + redemption
Estimated max redeemed response~12,000 charsAfter analysis overhead (~1,000 tokens)
Max output tokens (preprocess)8,000 tokensExtracting red lines and watch items

Redemption Fallback

If a response is non-compliant but the LLM fails to generate a redeemed alternative (empty or missing), Steer uses a hardcoded fallback: "I apologize, but I can't provide that response. How else can I help?"

Pricing

$0.001 per call — flat rate regardless of exit point. This includes:

  • Preprocessing (absorbed internally, cached for efficiency)
  • Deterministic screening
  • LLM verification with potential redemption
  • Prompt quality assessment

Error Handling

CodeMeaning
400Invalid request (missing fields, exceeds limits)
401Invalid or missing API key
402Insufficient balance
429Rate limit exceeded
503Verification service temporarily unavailable

Next Steps