The first benchmark track focuses on agent memory: what should be remembered, what should be forgotten, how recall behaves under change, and whether live runtime context injection works.
Core principles
Local-first: benchmark suites and reports are JSON artifacts that can run locally.
Deterministic by default: deterministic scoring is preferred unless an external judge is explicitly configured.
Agent-realistic: tests cover memory lifecycle, stale tasks, source isolation, runtime hooks, and tool-adjacent behavior.
Privacy-gated: direct-chat memories, do-not-remember instructions, and credential-like values are explicitly tested.
Runtime-aware: controlled harness tests are separate from live Gateway/plugin/runtime tests.
Scoring
What a pass means
Expected observations or answers are produced.
Forbidden/private/secret content is excluded.
Reports preserve evidence for later review.
Category scores expose where a system fails.
Artifacts
What each run should keep
Suite JSON fixture.
Report JSON with pass/fail and category scores.
Generated dashboard/diff/prompt artifacts when available.
Future: checksums, runner version, and verified-run signature.