Evals and Tracing
The AI stack has two feedback loops: deterministic/offline evals for repeatable quality checks and tracing for understanding live behavior.
Eval Harness
The eval harness lives under server/evals/. See EVAL_HARNESS.md for the
full operating guide.
Current slices include:
- Citation accuracy for rules questions.
- Encounter difficulty math.
- NPC generation quality with an LLM judge.
Run from server/ unless the harness docs say otherwise:
uv run pytest tests/test_evals_cases.py
uv run pytest tests/test_evals_citation_metrics.py
Some evals require a configured campaign or model provider. Keep judge-free metrics judge-free; do not add model calls to deterministic score paths.
Langfuse
Langfuse tracing is optional and enabled only when both public and secret keys are present. The app sets SDK environment variables during startup, instruments Anthropic, and flushes on shutdown.
Important trace names and tags include:
| Operation | Why It Matters |
|---|---|
chat.answer | Non-streaming rules answers. |
chat.stream | Streaming chat turns, tagged by mode. |
ingest.document | Document extraction/chunk/embed processing. |
process.document | Reprocessing saved extracted Markdown. |
npc.generate | Live NPC generation, useful for evaluators. |
Tracing must degrade to a no-op when unconfigured. Avoid changes that log scary SDK authentication errors in normal local runs.
LangSmith and LangGraph Studio
LangSmith tracing is configured through LANGSMITH_* settings. The graph is
registered in server/langgraph.json.
Start Studio from server/:
uv run langgraph dev
Use it to inspect the campaign chat graph, step through mode routing, and see node-level behavior against local data.
When to Add an Eval
Add an eval when a behavior is important enough that a future model, prompt, or retrieval change could quietly regress it. Prefer deterministic measurements where possible. Use LLM judges only when the target behavior is qualitative and the rubric is explicit.