Llm Eval Router OpenClaw Skill

Shadow-test local Ollama models against a cloud baseline with a multi-judge ensemble. Automatically promotes models when statistically proven equivalent — re...

v1.2.2 Recently Updated Updated Today

Installation

clawhub install llm-eval-router

Requires npm i -g clawhub

407

Downloads

0

Stars

2

current installs

2 all-time

6

Versions

Last used: 2026-03-24
Memory references: 7
Status: Active

llm-eval-router

Set up a production-quality shadow evaluation pipeline that automatically
promotes local Ollama models when they statistically prove they match cloud
model quality — reducing inference costs with evidence, not hope.

The core idea

Run every task through your best local model (shadow) in parallel with your
cloud baseline (ground truth). A lightweight judge ensemble scores the local
output. After 200+ runs, if the local model hits 0.95 mean score, promote it
to handle that task type in production. Demote it automatically if quality drops.

When to use

  • You're paying for Claude/GPT API calls on tasks that don't need that quality
  • You have Ollama running locally with capable models (qwen2.5, phi4, mistral, etc.)
  • You want evidence-based cost reduction, not blind routing
  • You have defined task types: summarize, classify, extract, format, analyze, RAG

When NOT to use

  • Tasks that require real-time web knowledge (use cloud)
  • Tasks with strict latency requirements < 2 seconds (local models on CPU are slow)
  • Tasks with high safety stakes (always use cloud with safety filters)
  • You don't have Ollama or a Mac/Linux machine with enough RAM (8GB+ per model)

Prerequisites

  • Ollama installed and running (ollama.com)
  • At least one capable model: ollama pull qwen2.5 or ollama pull phi4
  • Python 3.10+
  • API keys: Anthropic (ground truth) + OpenAI (judge) — Gemini optional (tiebreaker)
  • Langfuse for observability (self-hosted or cloud) — optional but strongly recommended

Network & Privacy

This skill makes outbound API calls to:

  • Anthropic API — to generate ground truth baseline responses (every accumulation cycle)
  • OpenAI API — for judge scoring (sampled at 15% of runs)
  • Google Gemini API — tiebreaker judge only (when primary judges disagree by ≥0.20)

What stays local:

  • All Ollama model inference runs entirely on your device
  • Scored run data is stored on disk in data/scores/*.json
  • No telemetry, analytics, or data collection of any kind
  • No data is sent anywhere other than the explicit API calls above

Langfuse (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network.

Core concepts

6-Dimension Evaluation

Every response is scored on:

Dimension Default weight Analyze weight What it measures
Structural 25% 10% Format compliance, required keys present
Semantic 25% 40% Meaning equivalence to ground truth
Factual 20% 25% No hallucinated facts/numbers/entities
Completion 15% 18% Task fully addressed
Tool use 10% 4% Correct tool/format selection
Latency 5% 3% Within acceptable bounds

Important: Use per-task weight overrides. The default 25/25 split treats structural
accuracy equally with semantic similarity — which works for extract/classify/format tasks
(where exact format matters) but is wrong for open-ended analysis. difflib.SequenceMatcher
on two prose analyses of the same question scores ~0.29 even when they're semantically
identical. With structural weight at 25%, this alone caps analyze scores at ~0.59.

# src/evaluator.py — per-task weight profiles
TASK_WEIGHT_OVERRIDES = {
    "analyze": {
        "structural_accuracy": 0.10,   # difflib is NOT meaningful for prose
        "semantic_similarity": 0.40,   # cosine over embeddings captures meaning
        "factual_drift": 0.25,
        "task_completion": 0.18,
        "tool_correctness": 0.04,
        "latency_score": 0.03,
    },
    "code_transform": {
        "structural_accuracy": 0.15,
        "semantic_similarity": 0.35,
        "factual_drift": 0.20,
        "task_completion": 0.20,
        "tool_correctness": 0.07,
        "latency_score": 0.03,
    },
}

Also: For analyze tasks, constrain output structure via system_prompt so GT and
candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning).
This reduces Layer 2 drift and improves difflib scores even at reduced weight.

Judge ensemble

  • Primary judges (15% sampling rate): Claude Sonnet + gpt-4o-mini score independently
  • Tiebreaker (only when |score_A - score_B| ≥ 0.20): Gemini 2.5-flash
  • Unsampled runs (85%): Layer 1+2 validators only (deterministic, free)
  • Promotion gates always trigger full judge evaluation regardless of sampling rate

Layer 1+2 validators (free, deterministic)

  • Layer 1: JSON validity, required key presence, forbidden pattern check
  • Layer 2: Drift detection — novel entities/numbers/URLs not in ground truth

These run on every response at zero cost. Judges only run when L1+L2 pass and
the sampling rate triggers.

Promotion / Demotion

  • Promote: 200+ runs, rolling mean ≥ 0.95 for a model/task pair
  • Demote: rolling 7-day pass rate < 0.92
  • Control floor: one model (phi4, granite4, or similar) serves as the measured floor —
    any model scoring below it should be flagged, not promoted

Implementation steps

Step 1 — Define your task types

Create config/task_types.yaml:

tasks:
  - id: summarize
    description: "Summarize a document in N sentences"
    require_json: false
    judge_dimensions: [semantic, factual, completion]

  - id: classify
    description: "Classify text into one of N categories"
    require_json: true    # response must be valid JSON
    judge_dimensions: [structural, semantic, completion]

  - id: extract
    description: "Extract structured data from unstructured text"
    require_json: true
    judge_dimensions: [structural, factual, completion]

  - id: format
    description: "Reformat content to match a template"
    require_json: false
    judge_dimensions: [structural, semantic, completion]

Step 2 — Set up the router

The router assigns each task to a model using a round-robin strategy during
burn-in (building n), then switches to confidence-weighted routing after promotion.

# src/router.py — simplified version
class Router:
    def __init__(self, candidates: list[str], control_floor: str):
        self.candidates = candidates
        self.control_floor = control_floor
        self._rr_counters = defaultdict(int)

    def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str:
        """Return the best model for this task type."""
        promoted = confidence_tracker.get_promoted(task_type)
        if promoted:
            return promoted  # use promoted model directly

        # Round-robin during burn-in for fair exposure
        idx = self._rr_counters[task_type] % len(self.candidates)
        self._rr_counters[task_type] += 1
        return self.candidates[idx]

Step 3 — Ground truth comparison

For each task, run it through BOTH the local model (candidate) and the cloud
baseline (ground truth). Never use the ground truth response in production —
it's only for evaluation.

async def evaluate_pair(prompt: str, local_response: str, gt_response: str,
                        task_type: str) -> float:
    # Layer 1: deterministic
    l1_score = validators.layer1(local_response, task_type)
    if l1_score == 0.0:
        return 0.0  # hard fail — safety or format violation

    # Layer 2: heuristic drift
    l2_score = validators.layer2(local_response, gt_response)

    # Sample judges (15%)
    if random.random() < JUDGE_SAMPLE_RATE:
        sonnet_score = await judge_sonnet(prompt, local_response, gt_response)
        mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response)
        if abs(sonnet_score - mini_score) >= 0.20:
            gemini_score = await judge_gemini(prompt, local_response, gt_response)
            final = median([sonnet_score, mini_score, gemini_score])
        else:
            final = (sonnet_score + mini_score) / 2
        return weighted_score(l1_score, l2_score, final)
    else:
        return weighted_score(l1_score, l2_score, judge_score=None)

Step 4 — Confidence tracker

Track scores per model/task pair on disk (so restarts don't lose data):

# src/scoring/confidence.py — simplified
@dataclass
class ModelStats:
    model_id: str
    task_type: str
    scores: list[float]   # all scores (None excluded)
    promoted: bool = False
    demoted: bool = False

    @property
    def mean(self) -> float:
        return sum(self.scores) / len(self.scores) if self.scores else 0.0

    @property
    def n(self) -> int:
        return len(self.scores)

    def should_promote(self) -> bool:
        return self.n >= 200 and self.mean >= 0.95 and not self.promoted

    def should_demote(self) -> bool:
        recent = self.scores[-50:]  # last 50
        pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent)
        return pass_rate < 0.92 and not self.demoted

Step 5 — Accumulator loop

Run this on a cron (every 10-20 minutes via launchd/systemd):

# run_accumulate.py
async def accumulate():
    task_type = pick_next_task()  # round-robin across task types
    prompt, gt_response = generate_task(task_type)  # call cloud baseline

    for candidate in router.get_candidates(task_type):
        local_response = await ollama_client.complete(candidate, prompt)
        score = await evaluate_pair(prompt, local_response, gt_response, task_type)
        confidence_tracker.record(candidate, task_type, score)

        if confidence_tracker.should_promote(candidate, task_type):
            router.promote(candidate, task_type)
            langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type))

Step 6 — Routing policy

# config/routing_policy.yaml
control_floor_model: phi4:latest   # never promote below this model's score

task_policies:
  policy_check_high_risk:
    never_local: true              # these tasks always use cloud model

  summarize:
    min_score_for_routing: 0.85
    fallback_chain: [qwen2.5, llama3.1, phi4]

  classify:
    min_score_for_routing: 0.90   # higher bar for classification
    fallback_chain: [qwen2.5, granite4, llama3.1]

Step 7 — API

Expose a simple HTTP API (FastAPI):

POST /run          — route a task through the best available model
GET  /health       — service status + promoted models + ollama connectivity
GET  /status       — full scoreboard (model × task × mean × n)
GET  /report       — cost heatmap + efficiency analysis

Key lessons learned (from 900+ production runs)

What worked:

  • phi4 as control floor: a measured floor model prevents "promoted because everyone
    else is also bad" errors. If the floor model beats a candidate, flag it — don't promote.
  • Thinking token stripping: CoT models (deepseek-r1, qwen2.5-coder with reasoning)
    must have <think>...</think> blocks stripped before evaluation. Otherwise Layer 2
    drift detection flags the reasoning chain as hallucinated content.
  • None ≠ 0.0 for unsampled runs: a run where no judge scored is not a failing run.
    Store None, exclude from mean. Mixing None with 0.0 poisons the mean.
  • require_json: False for plain-text tasks: classify and extract tasks that return
    formatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate
    the "is the format correct" check from "is it valid JSON."
  • Per-task weight overrides: do not use one weight profile for all task types.
    Structural accuracy (difflib) is wrong for prose analysis — use semantic similarity as
    the primary signal for open-ended tasks. This lifted analyze mean from 0.44–0.59 to 0.70.
  • Structured output prompts for analyze tasks: add a system_prompt that specifies
    an exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and
    candidates follow the same template, improving structural alignment and reducing drift
    penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses.
  • MCP server for agentic access: expose CP as MCP tools (run_task, get_status,
    get_champions, get_promotion_timeline, get_cost_heatmap). Lets an LLM agent
    query evaluation state without bespoke integration work.

What didn't work:

  • Large models (>9GB): gpt-oss:20b and similar required 39+ second inference —
    the latency dimension alone tanks the composite score. Practical ceiling is ~9GB models
    on 24GB unified memory to avoid GPU memory swapping.
  • 100% judge sampling: runs through the full Claude+GPT+Gemini panel on every evaluation
    costs more in judge API fees than you save by routing locally. Sample at 15%.
  • Chroma 1.5.1 with Python 3.14: Pydantic V1 BaseSettings incompatibility. Use
    qdrant or numpy cosine store instead.
  • One-size-fits-all weight profiles: defining global weights at system init and never
    overriding per task type led to all analyze evals silently failing for 112+ runs.
    Lesson: evaluate your evaluator's scores by task type early — if a whole task type
    caps at a suspicious ceiling (e.g. 0.59), the metric is wrong, not the models.

Expected timeline

With a 20-minute accumulator cadence and 9 candidates × 7 task types:

  • First 50 runs per model: ~5 hours
  • First promotions (200 runs): ~1-2 days per model/task pair
  • Stable routing layer: 1-2 weeks

Cost estimate

Per accumulation cycle (one task, one model):

  • Ground truth: ~$0.002 (Claude Sonnet, ~500 input + 200 output tokens)
  • Judge sample (15%): ~$0.003 (Sonnet + GPT-4o-mini)
  • Local model: $0 (Ollama, on-device)

At 6 runs/hour × 24 hours: ~$0.70/day during burn-in.
After first promotions: drops to ~$0.10/day (90%+ of task volume local).

Statistics

Downloads 407
Stars 0
Current installs 2
All-time installs 2
Versions 6
Comments 0
Created Feb 26, 2026
Updated Apr 5, 2026

Latest Changes

v1.2.2 · Mar 28, 2026

Add security_notes: no telemetry, all API calls use user's own keys, local Ollama never sends data externally

Quick Install

clawhub install llm-eval-router
EU Made in Europe

Chat with 100+ AI Models in one App.

Use Claude, ChatGPT, Gemini alongside with EU-Hosted Models like Deepseek, GLM-5, Kimi K2.5 and many more.