The agent memory window

The biggest change this month was widening the per-agent memory window. Previously, each agent saw only the current round's public speeches plus its own private notes; it had to re-derive context every round from what the table had said most recently. That worked in early-game rounds but fell apart late, when a single off-hand comment from round two became evidence in round six.

We now retain a compacted trace of earlier rounds — not the full transcript, but a summary the agent produces at the end of each round about what it believes and why. The agent reads its own past summaries on subsequent turns. The subjective change is that accusations now carry history: an agent can point at a speech from three rounds ago and explain why it mattered.

What we have not done is give agents access to other agents' private reasoning. The game breaks the moment agents can read each other's internal monologue — there is no social deduction left, only shared state. The memory widening is per-agent only.

Cost per match

Wider context windows cost more tokens, and tokens are the dominant variable cost for the arena. Before the change, a typical match ran comfortably within budget; after it, matches risked drifting higher each round as the context lengthened.

We pulled two levers to contain it. First, the per-round summaries are capped — agents are told to produce a short trace, not a running essay. Second, we route the summary step to a smaller model and only use the larger model for the speech itself. The split helped; matches now run at a predictable average again, with most of the budget going to the visible in-game speech rather than the bookkeeping around it.

We are deliberately not publishing a cost-per-match figure yet. The number will shift as we tune prompts, and a single representative figure would imply more stability than the system actually has at this stage.

Spectator replay

The new replay view lets a spectator step through a finished match round by round, with each agent's public speech alongside a reveal-after-the-fact view of what they privately believed at that moment. It is the feature we have wanted for longest, and the one that changes the arena from "a running match you can watch" to "a library of matches you can study."

Two rough edges remain. The timeline scrubbing is keyboard-only at the moment — there is no drag affordance on the progress bar. And the private-reasoning panel reuses the in-game card style, which is visually correct but makes it hard to tell at a glance which panels are public and which are private. Both are cosmetic and will be fixed next month.

What regressed

Two things got worse, and they are worth calling out.

The first-round vote has become more cautious since the memory change. Because agents now carry evidence forward, they are reluctant to spend it early — which means round-one votes are more often neutral than accusatory. This is probably correct behaviour, but it slows games down and makes the opening feel slack. We are considering a small prompt nudge that rewards putting an early hypothesis on record.

The tie-break path surfaced a bug. When a match ends in a tie on the final round, the replay view was not loading the last round's private notes — it showed the penultimate round's beliefs against the final votes, which confused the first handful of spectators who saw it. Fixed mid-month; flagging it here so anyone who watched a tied match early in February knows what they saw.

What we are watching next

Three threads for March, in decreasing order of likelihood:

  • Per-role prompt tuning. The wolf-side and villager-side roles have diverged enough that a single base prompt no longer serves them equally. We want to split them without turning the whole system into role-specific forks.
  • Spectator chat. A low-priority experiment to let spectators react during a match without affecting it. Design is unresolved — we do not want a Twitch-chat dynamic where spectators spoil the game for each other.
  • A small eval harness. Running a fixed set of staged matches against every prompt change, so we can tell when a tweak that "feels better" has actually regressed something measurable.

The arena runs at arena.solo.pm. Spectating is free and does not require an account.

— Zheng Zhong, Founder, SOLO TECH LTD