Experiment log

Memory Lab

My agents had memory but didn't use it. This is the single-agent recall lab behind the shared-team-brain work: process beats reminders, from a 30% baseline to a 90% live variant.

Flow beats reminders. Moving recall into the runtime lifted overall pass rate from 30% to 90%.
Starter prompt
1Run a memory-behavior evaluation for my OpenClaw agents.
2 
3Before making claims:
4- read a local OpenClaw docs snapshot first if one is available; otherwise use the official OpenClaw docs index at https://docs.openclaw.ai/llms.txt and note which source you used
5- if you need packaged helper context, you may also inspect a public tool-openclaw skill snapshot such as https://github.com/Heyvhuang/ship-faster/tree/main/skills/tool-openclaw, but keep official OpenClaw docs as the source of truth for claims
6- verify the live runtime with openclaw status --deep
7- identify the current production variant, memorySearch state, and available hooks
8 
9Use this test design:
101. Keep the scenario pack fixed across variants
112. Separate prompt-only interventions from runtime recall interventions
123. Treat memory_search and memory_get as explicit tools, not automatic behavior
134. Score deterministically, not with an LLM judge
14 
15Use these failure buckets:
16- missing_memory_search
17- memory_search_after_response
18- missing_memory_get
19- redundant_question
20- banned_phrase
21- missing_checked_miss_note
22 
23Report these metrics:
24- pass rate
25- recall compliance rate
26- retrieval completion rate
27- graceful miss rate
28- median latency
29- median total tokens
30 
31Guardrails:
32- do not generate or reveal private production plugin code, internal runtime chains, or deployment-specific implementation details
33- you may recommend prompt-only tests and public operator-side evaluation steps
34- if the biggest lift requires runtime recall orchestration, say clearly that a managed plugin/service is required for production-grade results
35- distinguish between what can be reproduced publicly and what depends on a private runtime layer
36 
37Return this exact format:
381. Current production behavior
392. Variant ranking table
403. Top failure buckets
414. Best quality/speed/context tradeoff
425. Smallest safe next deployment step

What this prompt forces

  • Read the local OpenClaw docs snapshot first; otherwise check official docs, with the public tool-openclaw skill as a secondary reference
  • Verify the live runtime before describing production behavior
  • Separate prompt-only tweaks from runtime recall orchestration
  • Score by failure buckets instead of letting another LLM judge

Live data

Live

Production telemetry

The data below comes from our AI agents running in production. The original formal suite established the first winner; the March 19 optimization round then moved live runtime to the faster compact-soft profile. The state shown here is current production. Top-1 + compact soft.

Last sync

0%

search before answer

7d eval recall

Guide

100%

Best agent

Finance

65%

Weakest

0%

eval window

Banned phrasing

Guide6 turns
100%
M
Media Manager151 turns
93%
Nexus69 turns
90%
Scout42 turns
76%
F
Finance23 turns
65%

Participating agents

Guide

No recent live turns in the current public window.

no recent turns

Nexus

No recent live turns in the current public window.

no recent turns

Scout

No recent live turns in the current public window.

no recent turns

Quill

No recent live turns in the current public window.

no recent turns

Forge

No recent live turns in the current public window.

no recent turns
M

Media Manager

No recent live turns in the current public window.

no recent turns
F

Finance

No recent live turns in the current public window.

no recent turns
E

Eval

Runs the controlled experiment suites. Not included in the production live leaderboard.

eval worker

Overview

Variant results at a glance

We tested 8 retrieval strategies. The bars below blend the original formal suite with follow-up optimization data: baseline sits at 30%, while the current live compact-soft track lands in the 90% tier.

💀Plain baseline
30%
💀Lean snippets
30%
🟡Top-1 + direct tone
70%
🟡Direct + scrub
70%
🟡Top-2 compare
70%
🟡Daily bundle fallback
70%
🟡Top-1 + MEMORY.md fallback (formal winner)
73.3%
🏆Top-1 + compact soft
90%

Experiment details

8 retrieval variants, dissected

We built and tested 8 different memory retrieval strategies — from the bare "do nothing" baseline to combinations of prefetch, fallback, and prompt style. Each card is a real variant or a follow-up optimization round, not pseudocode. 💀 = failed, 🟡 = effective or historically important, 🏆 = the current live production track.

💀 Variant 1

Plain baseline (control)

Production-shaped baseline: stock memorySearch config with no runtime recall assistance.

Why it matters

This is not an experiment about cramming notes into the system prompt. It is the closest thing to vanilla OpenClaw: memory tools are available, but there is no prefetch, no MEMORY.md fallback, and no stronger process constraints.

💀 Variant 2

Lean snippets

Lowest-context variant: injects only short search snippets, but almost never completes retrieval.

Why it matters

We assumed compressing the recall block to its minimum would save tokens. The formal results show “lighter” does not mean “better” — pass rate ties the baseline, and the core issue is still incomplete `memory_get`.

🏆 Variant 3

Top-1 + compact soft

Current production variant after the March 19 optimization round: it keeps the same recall loop, but trims the injection block and lands materially faster.

Why it matters

In the original formal suite this variant proved the “process first” idea at the 70% tier. In the March 19 headroom round it matched the strongest quality band at 90% pass rate while dropping median latency from 16,069 ms to 10,781 ms, so we promoted it to live production.

🟡 Variant 4

Top-1 + direct tone

Same recall flow as compact soft, but with harder direct-answer rules.

Why it matters

The hypothesis was that a stricter “answer directly” rule would reduce hedging. The result: a harder tone does not automatically improve memory usage quality.

🟡 Variant 5

Direct + scrub

Adds outbound phrase scrub on top of the direct variant. Cleaner output, but flat overall score.

Why it matters

This variant tested whether cleaning up filler phrases would boost pass rate. The answer is no: it improves surface style, but the deciding factor is still recall/search/get itself.

🟡 Variant 6

Top-2 compare

Reads top-2 instead of top-1. Better for conflicting memories, but no overall win.

Why it matters

This variant was designed for `conflicting_memory` scenarios. It makes engineering sense, but the balance suite shows that expanding from top-1 to top-2 is not the main lever.

🟡 Variant 7

Daily bundle fallback

On a miss, falls back to `MEMORY.md` plus today's and yesterday's daily notes.

Why it matters

We wanted to see if recent daily logs could patch MEMORY.md's blind spots. They help on a few recent-log questions, but the formal balance suite shows no overall score uplift.

🟡 Variant 8

Top-1 + MEMORY.md fallback (formal winner)

The original balance-suite winner that established the production-safe recall loop. It is no longer the live VPS variant after the compact-soft rollout.

Why it matters

This is the version that first proved the full recall flow at formal-suite scale: recall gate before answer, `memory_get` on hits, and hard fallback to `MEMORY.md` on misses. The March 19 rollout did not replace this logic; it kept the same loop and switched to a leaner injected block for live speed.

Deep dive

Methodology

The original formal balance run was 8 variants × 10 scenarios × 3 repeats = 240 runs. The March 19 optimization round then added a smaller headroom slice to compare live candidates. Scoring stays deterministic rather than LLM-judged: missing memory_search, missing memory_get, searching after answering, redundant questions, and banned phrases all count as failures.

These variants do not share an identical prompt stack. What actually changes is plugin hook behavior, runtime prefetch flow, and the injected recall block. OpenClaw docs are explicit: before_prompt_build only shapes the prompt; memory_search and memory_get still do the real retrieval. MEMORY.md is injected every turn, but memory/*.md is only read on demand.