Skip to main content

Evals and Tracing

The AI stack has two feedback loops: deterministic/offline evals for repeatable quality checks and tracing for understanding live behavior.

Eval Harness

The eval harness lives under server/evals/. See EVAL_HARNESS.md for the full operating guide.

Current slices include:

  • Citation accuracy for rules questions.
  • Encounter difficulty math.
  • NPC generation quality with an LLM judge.

Run from server/ unless the harness docs say otherwise:

uv run pytest tests/test_evals_cases.py
uv run pytest tests/test_evals_citation_metrics.py

Some evals require a configured campaign or model provider. Keep judge-free metrics judge-free; do not add model calls to deterministic score paths.

Langfuse

Langfuse tracing is optional and enabled only when both public and secret keys are present. The app sets SDK environment variables during startup, instruments Anthropic, and flushes on shutdown.

Important trace names and tags include:

OperationWhy It Matters
chat.answerNon-streaming rules answers.
chat.streamStreaming chat turns, tagged by mode.
ingest.documentDocument extraction/chunk/embed processing.
process.documentReprocessing saved extracted Markdown.
npc.generateLive NPC generation, useful for evaluators.

Tracing must degrade to a no-op when unconfigured. Avoid changes that log scary SDK authentication errors in normal local runs.

LangSmith and LangGraph Studio

LangSmith tracing is configured through LANGSMITH_* settings. The graph is registered in server/langgraph.json.

Start Studio from server/:

uv run langgraph dev

Use it to inspect the campaign chat graph, step through mode routing, and see node-level behavior against local data.

When to Add an Eval

Add an eval when a behavior is important enough that a future model, prompt, or retrieval change could quietly regress it. Prefer deterministic measurements where possible. Use LLM judges only when the target behavior is qualitative and the rubric is explicit.