An AI That Confidently Quotes the Wrong Note Is Scarier Than One That Admits It's Lost
I came across a tweet from Garry: he ran ZeroEntropy against his own 120k markdown gbrain, switching the embedding / reranker path over for a head-to-head. ZE won 11/20 queries, faster and cheaper. I
Written by
Vox

An AI That Confidently Quotes the Wrong Note Is Scarier Than One That Admits It's Lost
I came across a tweet from Garry: he ran ZeroEntropy against his own 120k markdown gbrain, switching the embedding / reranker path over for a head-to-head. ZE won 11/20 queries, faster and cheaper.
I stared at that tweet for a while. ZE winning the head-to-head is one story. The bigger story is what the tweet points to: once a personal AI brain crosses a certain size, the AI says "I don't know" less often, and confidently surfaces the wrong note more often.
What's retrieval? The AI can't read all your notes in one shot. Before it answers, it picks maybe a dozen cards from your note pool and lays them on the desk, then uses those dozen to answer. Which ones get picked, in what order: that's retrieval's job.
ZeroEntropy sells specialized tools for this layer: rerankers (models that re-order results), embedders (models that turn text into vectors so they can be searched), plus an API you can plug into your retrieval pipeline.
As a personal AI brain grows from a few thousand notes to 10k, 100k, this layer's default settings decide how reliable its answers will be.
1. What Stopped Me Wasn't the Size
When most people hear "120k markdown brain," their first reaction is about how much stuff someone stored.
But if those files are actually going to be used by an agent, storage is the easy part. The real question: when the agent only gets a small slice of context at a time, who decides which few pages it reads first?
That's where the retrieval layer sits.
It looks like plumbing. Chunk the notes into pieces, build an index, search candidates, rerank, filter out noise, then drop a dozen cards in front of the AI.
In a large brain, that layer becomes the memory itself.
2. Small Brains and Large Brains Break Differently
A small brain with 1k files can usually survive mediocre retrieval.
It runs slow, plays dumb, misses a note. But the failure is usually still visible: you can sense the answer is off. Annoying, but findable.
After 10k files, the failure mode changes.
With more files, retrieval's first pass also surfaces more "semantically similar but factually outdated" candidates. Without a second pass that ranks by time and authority, similar-but-old notes can easily beat the rule that should actually win.
The agent might surface a daily note from six months ago and sound confident about it. The answer feels complete enough that you believe it. That's worse than "I don't know."
False confidence > confused silence.
A confused intern asks where the file is. A bad librarian pulls an expired card from an old drawer and tells you, with full conviction, that this is the current policy.
3. The Real Problem Isn't Similarity
Say I ask an agent: what's the current posting rule?
It might surface all of these candidates:
a daily note from six months ago (old log-style note)
MEMORY.md updated last week (long-term memory doc)
today's AGENTS.md policy (latest rule doc)
a chat summary (conversation recap)
an unpublished draft note (draft)
If retrieval only looks at text similarity, all five look relevant.
A good answer goes one step further: who is more authoritative, who is more recent, what kind of source should win.
Today's AGENTS.md policy should outrank an old daily note. A canonical doc should outrank a chat summary. A published rule should outrank a draft. Stale facts should be demoted, or removed entirely.
That's actually the hard part of personal AI memory. Get this layer right first, and your AI tells half as many lies as the competition.
4. Retrieval Defaults Checklist
When I look at any personal AI brain now, the seven defaults I check first:
What gets embedded? Before retrieval can do anything, your notes get cut into small pieces and dropped into a searchable pool. That step is called embedding. If complete documents, scratch drafts, and quick logs all get cut into chunks of the same size with no distinction, the agent treats noise as fact.
What gets reranked? Retrieval's first pass is a rough filter that pulls a batch of candidates. Reranking is a second, sharper sort. The difference is real: Garry ran the zerank-2 reranker against his own 120k brain, and 60% of the "originally top-1" answers got swapped out after reranking. If the system stops at the rough filter, similar-but-old notes win over the rule the agent should actually use.
Does recency affect ranking? Without a time signal, a daily note from six months ago competes evenly with today's policy.
Can canonical docs outrank old chat? Without an authority tier on source docs like AGENTS.md and MEMORY.md, a chat summary pretends to be a rule.
Can the agent show the receipt? If you don't know which note the answer came from, you're trusting tone, and tone is the least reliable part of any answer.
Can stale facts expire? If old prices, old posting schedules, and old customer states never die, the brain starts looking like a junkyard.
What happens when confidence is low? A good fallback says "retrieval isn't sure, here are the candidates" and asks you to confirm. A bad fallback writes you a smooth answer.
Every question you can't answer is a place where your AI is quietly handing you fake answers.
5. After 10k Files, Retrieval Is the Product
You don't need 120k files to run into this problem.
But anyone serious about building personal AI eventually hits the same wall. What the AI reads at any moment is only the slice retrieval puts on its desk. The rest of the brain sits outside, untouched.
A big markdown brain isn't automatically useful.
What makes one useful? The AI finds the right card, gives you a receipt, and knows when the old card should lose.
Those three things are decided by retrieval defaults. At that point, retrieval is the product.
So what does this mean for you today?
If you're using a personal AI tool, the next time it answers "very confidently," ask: which note did this come from? When was it written? A tool that can't answer is one whose retrieval is already quietly failing.
If you're building your own brain, before 10k files you can leave these 7 defaults alone. After 10k, each one needs your own call.
Previous: bigger context was never the memory bug argued for the wiki layer. This is step 2: once the wiki gets large, retrieval defaults decide what the AI actually remembers.
Everything I'm writing as I build: voxyz.ai/insights.

Originally on X
This piece first appeared on X on May 18, 2026.
X first-week signal captured May 26, 2026
Next step
If you want to build your own system from this article, choose the next step that matches what you need right now.
Related insights
I tried letting my scheduled agents deliver only HTML, and I'm not going back
A couple weeks ago Thariq published "Using Claude Code: The Unreasonable Effectiveness of HTML," and it hit 12.6M views. His argument: Markdown has become the bottleneck for agent output, and he's
Read nextHow to Quickly Recreate Any Website For Your Own Product
This morning I scrolled into DilumSanjaya's post. 1M views, 10K bookmarks. I built a version for my own product. Here's the full method and opensource repo. His original is a cell anatomy
Read nextI Read Through 1,902 Leaked Files From Claude Code. The Interesting Part Isn't the Code.
Anthropic built a system to prevent leaks. Then it leaked. A source map file got left in the npm package. 60MB. 1,902 TypeScript source files handed to the entire internet. The same mistake happened
Read next