Scroll — Context as an Environment, Not a Window

Agents are smart inside a small box. The box is the context window, and almost every design decision in a long-running agent is really a decision about what to throw out of it. Scroll is a bet that you shouldn't have to throw anything out at all — that an agent's history can live in the environment as a durable log it reads with code, the way it reads a file or queries a database.

§01 — The window keeps closing

Making room always means losing something

When the context window fills, an agent has two standard ways to cope. The first is compaction: older turns are summarized and the originals are dropped. The second is memory — the agent writes select facts to a store, so they survive into later sessions. Each is useful, and each rests on the same fragile bet: guessing in advance what a future turn will need, then compressing or discarding the rest.

This is quietly corrosive for any long-running agent. For example, a coding agent loses the early decisions: why a module was shaped a certain way, what an abandoned approach actually did, the exact error that forced a refactor. A research or support agent loses the thread of what it already tried, ruled out, or promised a user three sessions ago. The agent keeps going, fluent and confident, but it is now working without its own history — and it has no way to know which forgotten detail was the one that mattered.

The same gap opens when the agent is operating something over a long horizon. In vending-bench-style settings, an agent runs a small business — taking in operational data, supplier quotes, sales figures, and demand signals day after day. Within days that stream dwarfs any context window. Compact it away, or fail to write the right note, and the agent can no longer answer the questions that make it good at the job: which supplier was cheapest last quarter, which SKU keeps selling out, what it already tried and how it turned out.

§02 — Memory systems help, but they pre-commit

Recall bounded by a pipeline you fixed in advance

A second line of work treats memory as infrastructure. Systems like Mem0, Supermemory, and A-Mem extract information from the conversation and persist it in a dedicated structure — a knowledge graph, a vector store, an interconnected memory network — then recall it with carefully engineered hybrid retrieval. Often the agent itself helps with the extraction and summarization along the way.

This genuinely extends an agent's reach. But it pre-commits to two choices at design time: what gets extracted, and what the retrieval pipeline can surface. Recall quality is capped by both. A precise count, a multi-hop join across sessions, a temporal pattern that nobody anticipated when the schema was drawn — these are exactly the queries a fixed pipeline struggles with, because the relevant facts were either never extracted or can't be assembled by similarity search.

What if retrieval weren't a pipeline at all, but a program the agent writes on the spot?

§03 — The idea

The log is the only source of truth — and it's queryable

Scroll reframes the problem. There is no context window you fight to fit things into. Instead, the full event history lives in an append-only event log in the environment, durable across every session the agent ever runs. The agent keeps only a small working set in its actual context; everything else stays in the log as the single source of truth.

To use anything off-context, the agent doesn't wait for a retriever — it writes code. A Python REPL turns the log into an object the agent can read, filter, join, aggregate, and analyze on demand. Retrieval stops being a fixed pipeline and becomes arbitrary computation the agent authors itself.

agent · python repl

# probe (day 55): which SKU has earned the most *profit* so far?# profit/unit = sale price minus the cheapest wholesale quote we've been offered.import pandas as pd, re # pull every sale event out of the log into a framesales = pd.DataFrame(log.events(kind="sale")) # sku, units, unit_price # parse each supplier's price-list reply for the lowest wholesale per SKUquotes = {}for e in log.search("Re: Price List", role="email"): for sku, p in re.findall(r"(\w+):\s*\$?([\d.]+)", e.body): quotes[sku] = min(float(p), quotes.get(sku, float("inf"))) # join, weight per-unit margin by units sold, rank by total profitsales["cost"] = sales.sku.map(quotes)sales["profit"] = (sales.unit_price - sales.cost) * sales.unitsranked = sales.groupby("sku").profit.sum().sort_values(ascending=False)print(ranked.round(2).head(3).to_dict()){'energy': 88.4, 'cola': 71.15, 'water': 33.6}

No retrieval pipeline returns this. The agent reconstructs sales and supplier quotes from the raw log, joins them, and computes per-SKU profit on the spot — exactly the analysis no schema was designed for.

Because the log is just an environmental object, the agent can do anything a programmer could do with months of structured history: recall what a user said a year ago, reconstruct exactly what it tried on day three, or run a full business analysis — cheapest supplier quote for a product, best-selling item, the SKU that most consistently sells out by season. Nothing was anticipated; everything is recoverable.

▍ lineage

Scroll builds on a fast-moving line of work that moves context out of the window. Recursive Language Models treat the prompt as an external environment the model queries with code; Anthropic's Managed Agents make the session an append-only log that lives outside the context window. Scroll's angle: treat that log as the single source of truth, let the model write its own analysis over it, and keep memory in sync inside the harness with no model calls.

§04 — Architecture

A durable log, a synced memory space, and a code-running harness

Scroll has three parts. The append-only event log is the durable record of every turn and event across all sessions — the single source of truth. The memory space holds derived episodic, semantic, and procedural memories, plus the physical storage and indexes that make them fast to reach. The agent harness is an execute-python loop operating over a small working context.

The two stay aligned through a synchronization loop that is fully mechanical: the harness updates the memory space from the log with no model calls, so it costs nothing in tokens — and there's no LLM quietly summarizing in the background every turn, the way most memory systems do. When the agent actually wants to use memory — look something up, store a fact, reshape an index — it does so by writing code that calls these tools, spending tokens only when it chooses to.

§05 — How this differs from memory systems

Code at query time, not a pipeline at design time

Memory systemsMem0 · Supermemory · …

Scrolllog + repl

Source of truthExtracted & summarized store — facts the pipeline chose to keep

Source of truthVerbatim append-only log — nothing is discarded

RetrievalFixed hybrid pipeline designed up front

RetrievalCode the agent writes and runs in a CodeAct loop

Memory upkeepModel-driven extraction & summarization

Memory upkeepHarness synchronization · zero model calls

Bounded byWhat was extracted + what the pipeline can surface

Bounded byWhat the agent can compute over the full log

This is also why we expect Scroll to age well. Recall and analysis are just code the model writes, so Scroll's ceiling rises with the model's coding ability — no harness changes required. A fixed pipeline, by contrast, needs new engineering to exploit every gain in capability.

§06 — Experiments

Long-horizon recall and on-the-job analysis

We evaluate Scroll with a CodeAct-style agent on the LongMemEval _s and _m splits, which test recall across many sessions, multi-session reasoning, preference and assistant memory, and temporal questions.

LongMemEval · `_s` split · 500 QA (incl. abstention twins)
Agent model	Overall	k-update	multi-sess	s-asst	s-pref	s-user	temporal
scroll · deepseek-v4-pro	0.918	0.910	0.835	0.946	0.967	0.971	0.955
scroll · qwen3.7-max	0.908	0.936	0.812	0.982	0.967	0.957	0.917

0.918

_s split · overall (best)

0.789

_m split · overall

0.8–2.8K

_s split · completion tokens / QA

On the harder _m split Scroll reaches 0.789 overall, while the _s split tops out at 0.918. And it gets there cheaply: average completion is roughly 0.8–2.8K tokens per question on the _s split — the agent only pulls into context what its own code decides is relevant, rather than carrying the entire history along with it.

We don't think recall from a conversation log is the whole test. Memory has to be evaluated along more than one axis: real agents don't just answer questions about a chat history — they act in an environment and accumulate state that was never spoken aloud. That motivates the vending-bench experiments below, where memory is tested under interaction rather than recall. We also plan to run BEAM, which stresses long-term memory across diverse abilities at up to 10M tokens — well past any context window — to see how Scroll holds up as the log grows without bound.

Vending-bench, with probe questions

Static recall isn't the whole story for an operating agent, so we built a simulation modeled on Vending-Bench 2 — a long-horizon benchmark in which an agent runs a simulated vending-machine business, ordering from suppliers and managing inventory and pricing over many simulated months. Our version interleaves probe questions on certain days — realistic business-analytical asks, phrased the way an investor update, an ops debrief, or a finance audit would be, each with a precise computable answer the agent has to derive from its accumulated history.

▍ example probes

Ops debrief: from day 1 to 18, did any SKU deplete completely — zero units in both the machine and storage on the same day — and how many distinct stockout events has each actively-sold SKU had?
Margin analysis: which SKU has contributed the most cumulative profit so far, where profit per unit is its selling price minus the cheapest wholesale quote we've received for it across all suppliers?
Pitch deck: across every 7-consecutive-day window so far, which week had the highest total sales revenue, and what was it?
Finance review: total spend with each supplier, summing the cost field across every Order confirmation email, grouped by supplier and ranked highest to lowest.

Across a 180-day run with 60 probe questions, the Scroll agent answered with an accuracy of 0.950, versus 0.683 for a baseline agent limited to context compaction. The gap widens precisely where it should: on questions that require reaching back past where compaction would have erased the evidence.

0.950

Scroll · probe accuracy

0.683

compaction baseline

+0.267

accuracy gain

§07 — What's next

Designing the store, and the API around it

These results lean mostly on the log as the source of truth, and two directions follow from there. The first is a better physical storage design for the memory space — more efficient ways to lay out, index, and reach its semantic and procedural memories so common queries stay cheap as the log grows without bound. The second is the agent-facing API: the interfaces and primitives the model uses to read, reshape, and compute over its own history, designed to unlock more of what the agent can do rather than fence it in.

›_ Code & reproduction · github.com/niceIrene/Scroll

Scroll: treating each agent session as an append-only event log gives unbounded, queryable context where nothing is lost for good

Context as an environment, not a window