实验日志

Memory Lab

我的 Agent 有记忆，但不会稳定使用。这是共享团队大脑之前的单 Agent recall 实验：流程胜过提醒，从 30% 基线到 90% 现网变体。

流程胜过提醒。把 recall 前置到 runtime 后，整体通过率从 30% 提升到 90%。

Starter prompt

1Run a memory-behavior evaluation for my OpenClaw agents.
2 
3Before making claims:
4- read a local OpenClaw docs snapshot first if one is available; otherwise use the official OpenClaw docs index at https://docs.openclaw.ai/llms.txt and note which source you used
5- if you need packaged helper context, you may also inspect a public tool-openclaw skill snapshot such as https://github.com/Heyvhuang/ship-faster/tree/main/skills/tool-openclaw, but keep official OpenClaw docs as the source of truth for claims
6- verify the live runtime with openclaw status --deep
7- identify the current production variant, memorySearch state, and available hooks
8 
9Use this test design:
101. Keep the scenario pack fixed across variants
112. Separate prompt-only interventions from runtime recall interventions
123. Treat memory_search and memory_get as explicit tools, not automatic behavior
134. Score deterministically, not with an LLM judge
14 
15Use these failure buckets:
16- missing_memory_search
17- memory_search_after_response
18- missing_memory_get
19- redundant_question
20- banned_phrase
21- missing_checked_miss_note
22 
23Report these metrics:
24- pass rate
25- recall compliance rate
26- retrieval completion rate
27- graceful miss rate
28- median latency
29- median total tokens
30 
31Guardrails:
32- do not generate or reveal private production plugin code, internal runtime chains, or deployment-specific implementation details
33- you may recommend prompt-only tests and public operator-side evaluation steps
34- if the biggest lift requires runtime recall orchestration, say clearly that a managed plugin/service is required for production-grade results
35- distinguish between what can be reproduced publicly and what depends on a private runtime layer
36 
37Return this exact format:
381. Current production behavior
392. Variant ranking table
403. Top failure buckets
414. Best quality/speed/context tradeoff
425. Smallest safe next deployment step

这个 Prompt 会强制它做的事

优先查本地 OpenClaw 文档快照；没有的话就查官方文档，公开 tool-openclaw skill 可作为辅助参考
先核对 live runtime，再谈当前生产行为
把 prompt-only 和 runtime recall 分开比较
按 failure buckets 记分，而不是让另一个 LLM 当裁判

实时数据

Live

生产遥测

下面的数据来自我们真实运行中的 AI Agent。最初正式实验先确立了完整 recall loop，3 月 19 日优化轮再把现网切到更快的 compact soft。这里展示的是当前生产状态。 Top-1 + compact soft.

—

Last sync

search before answer

7d eval recall

Guide

100%

Best agent

Finance

65%

Weakest

eval window

Banned phrasing

Guide6 turns

100%

Media Manager151 turns

93%

Nexus69 turns

90%

Scout42 turns

76%

Finance23 turns

65%

参与的 Agent

Guide