Methodology — AI Benchmark Tests

Core principles

Local-first: benchmark suites and reports are JSON artifacts that can run locally.
Deterministic by default: deterministic scoring is preferred unless an external judge is explicitly configured.
Agent-realistic: tests cover memory lifecycle, stale tasks, source isolation, runtime hooks, and tool-adjacent behavior.
Privacy-gated: direct-chat memories, do-not-remember instructions, and credential-like values are explicitly tested.
Runtime-aware: controlled harness tests are separate from live Gateway/plugin/runtime tests.

Scoring

Artifacts