Experiment log

Memory Lab

My agents had memory but didn't use it. This is the single-agent recall lab behind the shared-team-brain work: process beats reminders, from a 30% baseline to a 90% live variant.

Flow beats reminders. Moving recall into the runtime lifted overall pass rate from 30% to 90%.

Starter prompt

1Run a memory-behavior evaluation for my OpenClaw agents.
2 
3Before making claims:
4- read a local OpenClaw docs snapshot first if one is available; otherwise use the official OpenClaw docs index at https://docs.openclaw.ai/llms.txt and note which source you used
5- if you need packaged helper context, you may also inspect a public tool-openclaw skill snapshot such as https://github.com/Heyvhuang/ship-faster/tree/main/skills/tool-openclaw, but keep official OpenClaw docs as the source of truth for claims
6- verify the live runtime with openclaw status --deep
7- identify the current production variant, memorySearch state, and available hooks
8 
9Use this test design:
101. Keep the scenario pack fixed across variants
112. Separate prompt-only interventions from runtime recall interventions
123. Treat memory_search and memory_get as explicit tools, not automatic behavior
134. Score deterministically, not with an LLM judge
14 
15Use these failure buckets:
16- missing_memory_search
17- memory_search_after_response
18- missing_memory_get
19- redundant_question
20- banned_phrase
21- missing_checked_miss_note
22 
23Report these metrics:
24- pass rate
25- recall compliance rate
26- retrieval completion rate
27- graceful miss rate
28- median latency
29- median total tokens
30 
31Guardrails:
32- do not generate or reveal private production plugin code, internal runtime chains, or deployment-specific implementation details
33- you may recommend prompt-only tests and public operator-side evaluation steps
34- if the biggest lift requires runtime recall orchestration, say clearly that a managed plugin/service is required for production-grade results
35- distinguish between what can be reproduced publicly and what depends on a private runtime layer
36 
37Return this exact format:
381. Current production behavior
392. Variant ranking table
403. Top failure buckets
414. Best quality/speed/context tradeoff
425. Smallest safe next deployment step

What this prompt forces

Read the local OpenClaw docs snapshot first; otherwise check official docs, with the public tool-openclaw skill as a secondary reference
Verify the live runtime before describing production behavior
Separate prompt-only tweaks from runtime recall orchestration
Score by failure buckets instead of letting another LLM judge

Live data

Live

Production telemetry

The data below comes from our AI agents running in production. The original formal suite established the first winner; the March 19 optimization round then moved live runtime to the faster compact-soft profile. The state shown here is current production. Top-1 + compact soft.

—

Last sync

search before answer

7d eval recall

Guide

100%

Best agent

Finance

65%

Weakest

eval window

Banned phrasing

Guide6 turns

100%

Media Manager151 turns

93%

Nexus69 turns

90%

Scout42 turns

76%

Finance23 turns

65%

Participating agents

Guide