Agents are smart inside a small box. The box is the context window, and almost every design decision in a long-running agent is really a decision about what to throw out of it. Scroll is a bet that you shouldn't have to throw anything out at all — that an agent's history can live in the environment as a durable log it reads with code, the way it reads a file or queries a database.
Making room always means losing something
When the context window fills, an agent has two standard ways to cope. The first is compaction: older turns are summarized and the originals are dropped. The second is memory — the agent writes select facts to a store, so they survive into later sessions. Each is useful, and each rests on the same fragile bet: guessing in advance what a future turn will need, then compressing or discarding the rest.
This is quietly corrosive for any long-running agent. For example, a coding agent loses the early decisions: why a module was shaped a certain way, what an abandoned approach actually did, the exact error that forced a refactor. A research or support agent loses the thread of what it already tried, ruled out, or promised a user three sessions ago. The agent keeps going, fluent and confident, but it is now working without its own history — and it has no way to know which forgotten detail was the one that mattered.
The same gap opens when the agent is operating something over a long horizon. In vending-bench-style settings, an agent runs a small business — taking in operational data, supplier quotes, sales figures, and demand signals day after day. Within days that stream dwarfs any context window. Compact it away, or fail to write the right note, and the agent can no longer answer the questions that make it good at the job: which supplier was cheapest last quarter, which SKU keeps selling out, what it already tried and how it turned out.
Recall bounded by a pipeline you fixed in advance
A second line of work treats memory as infrastructure. Systems like Mem0, Supermemory, and A-Mem extract information from the conversation and persist it in a dedicated structure — a knowledge graph, a vector store, an interconnected memory network — then recall it with carefully engineered hybrid retrieval. Often the agent itself helps with the extraction and summarization along the way.
This genuinely extends an agent's reach. But it pre-commits to two choices at design time: what gets extracted, and what the retrieval pipeline can surface. Recall quality is capped by both. A precise count, a multi-hop join across sessions, a temporal pattern that nobody anticipated when the schema was drawn — these are exactly the queries a fixed pipeline struggles with, because the relevant facts were either never extracted or can't be assembled by similarity search.
The log is the only source of truth — and it's queryable
Scroll reframes the problem. There is no context window you fight to fit things into. Instead, the full event history lives in an append-only event log in the environment, durable across every session the agent ever runs. The agent keeps only a small working set in its actual context; everything else stays in the log as the single source of truth.
To use anything off-context, the agent doesn't wait for a retriever — it writes code. A Python REPL turns the log into an object the agent can read, filter, join, aggregate, and analyze on demand. Retrieval stops being a fixed pipeline and becomes arbitrary computation the agent authors itself.
No retrieval pipeline returns this. The agent reconstructs sales and supplier quotes from the raw log, joins them, and computes per-SKU profit on the spot — exactly the analysis no schema was designed for.
Because the log is just an environmental object, the agent can do anything a programmer could do with months of structured history: recall what a user said a year ago, reconstruct exactly what it tried on day three, or run a full business analysis — cheapest supplier quote for a product, best-selling item, the SKU that most consistently sells out by season. Nothing was anticipated; everything is recoverable.
A durable log, a synced memory space, and a code-running harness
Scroll has three parts. The append-only event log is the durable record of every turn and event across all sessions — the single source of truth. The memory space holds derived episodic, semantic, and procedural memories, plus the physical storage and indexes that make them fast to reach. The agent harness is an execute-python loop operating over a small working context.
The two stay aligned through a synchronization loop that is fully mechanical: the harness updates the memory space from the log with no model calls, so it costs nothing in tokens — and there's no LLM quietly summarizing in the background every turn, the way most memory systems do. When the agent actually wants to use memory — look something up, store a fact, reshape an index — it does so by writing code that calls these tools, spending tokens only when it chooses to.
Code at query time, not a pipeline at design time
This is also why we expect Scroll to age well. Recall and analysis are just code the model writes, so Scroll's ceiling rises with the model's coding ability — no harness changes required. A fixed pipeline, by contrast, needs new engineering to exploit every gain in capability.
Long-horizon recall and on-the-job analysis
We evaluate Scroll with a CodeAct-style agent on the LongMemEval _s and _m splits, which test recall across many sessions, multi-session reasoning, preference and assistant memory, and temporal questions.
| Agent model | Overall | k-update | multi-sess | s-asst | s-pref | s-user | temporal |
|---|---|---|---|---|---|---|---|
| scroll · deepseek-v4-pro | 0.918 | 0.910 | 0.835 | 0.946 | 0.967 | 0.971 | 0.955 |
| scroll · qwen3.7-max | 0.908 | 0.936 | 0.812 | 0.982 | 0.967 | 0.957 | 0.917 |
On the harder _m split Scroll reaches 0.789 overall, while the _s split tops out at 0.918. And it gets there cheaply: average completion is roughly 0.8–2.8K tokens per question on the _s split — the agent only pulls into context what its own code decides is relevant, rather than carrying the entire history along with it.
We don't think recall from a conversation log is the whole test. Memory has to be evaluated along more than one axis: real agents don't just answer questions about a chat history — they act in an environment and accumulate state that was never spoken aloud. That motivates the vending-bench experiments below, where memory is tested under interaction rather than recall. We also plan to run BEAM, which stresses long-term memory across diverse abilities at up to 10M tokens — well past any context window — to see how Scroll holds up as the log grows without bound.
Vending-bench, with probe questions
Static recall isn't the whole story for an operating agent, so we built a simulation modeled on Vending-Bench 2 — a long-horizon benchmark in which an agent runs a simulated vending-machine business, ordering from suppliers and managing inventory and pricing over many simulated months. Our version interleaves probe questions on certain days — realistic business-analytical asks, phrased the way an investor update, an ops debrief, or a finance audit would be, each with a precise computable answer the agent has to derive from its accumulated history.
- Ops debrief: from day 1 to 18, did any SKU deplete completely — zero units in both the machine and storage on the same day — and how many distinct stockout events has each actively-sold SKU had?
- Margin analysis: which SKU has contributed the most cumulative profit so far, where profit per unit is its selling price minus the cheapest wholesale quote we've received for it across all suppliers?
- Pitch deck: across every 7-consecutive-day window so far, which week had the highest total sales revenue, and what was it?
- Finance review: total spend with each supplier, summing the cost field across every Order confirmation email, grouped by supplier and ranked highest to lowest.
Across a 180-day run with 60 probe questions, the Scroll agent answered with an accuracy of 0.950, versus 0.683 for a baseline agent limited to context compaction. The gap widens precisely where it should: on questions that require reaching back past where compaction would have erased the evidence.
Designing the store, and the API around it
These results lean mostly on the log as the source of truth, and two directions follow from there. The first is a better physical storage design for the memory space — more efficient ways to lay out, index, and reach its semantic and procedural memories so common queries stay cheap as the log grows without bound. The second is the agent-facing API: the interfaces and primitives the model uses to read, reshape, and compute over its own history, designed to unlock more of what the agent can do rather than fence it in.