Keeping Long Chats Cheap and Coherent: Engineering Chat Compaction
Anyone who has shipped an LLM-powered product runs into the same wall: conversations grow, context windows do not. The first ten turns are fast and pleasant. By turn eighty the prompt is bloated, latency creeps up, costs scale linearly with every new message, and eventually the model starts dropping the early context that mattered most. This post is about the engineering pattern that solves that — chat compaction — and a concrete implementation of it from a production Django/DRF backend.
The Problem
Every LLM call you make is stateless. Whatever "memory" the assistant appears to have is reconstructed on every request by replaying the conversation so far. That replay has three costs that all grow with conversation length:
- Token cost. Most providers charge per input token. A long chat re-bills its entire history on every turn.
- Latency. Bigger prompts mean longer time-to-first-token and slower end-to-end responses.
- Quality cliff. Once you approach the model's context limit, you either truncate (and lose information) or fail outright. Even before the hard limit, the well-known "lost in the middle" effect means models attend less reliably to material buried in a long prompt.
The naive fixes are unsatisfying. Hard truncation drops the oldest turns first, which is usually where the user stated their actual goals. A sliding window of the last *N* messages ages information out at a fixed rate regardless of importance. Neither preserves the durable, high-value context — goals, decisions, open questions, named entities — that a useful assistant needs to keep referring back to weeks later.
Compaction is the alternative: instead of throwing old turns away, periodically summarize them into a dense, priority-ordered representation that travels alongside recent messages. The model sees a short rolling summary plus the last few turns verbatim, and the prompt size stays bounded no matter how long the conversation runs.
Anatomy of a Compaction System
A working implementation needs a few moving parts:
- A place to store the running summary alongside the raw messages.
- A trigger that decides when to compact.
- A prompt that produces the summary with the right priorities.
- An assembly step that combines the summary with recent turns at inference time.
- A budget mechanism so the summary itself does not grow without bound.
The implementation has three main pieces: a few fields on the chat model, a utility that folds new messages into the running summary, and an async worker call site that runs compaction after the assistant response is persisted.
Data Model: Raw Messages Plus a Watermark
Each chat record carries three things relevant to compaction:
messages = models.JSONField(blank=True, null=True)
compacted_text = models.TextField(blank=True, null=True)
compacted_message_count = models.IntegerField(
default=0,
help_text="Number of messages that were compacted in compacted_text",
)messages is the append-only transcript. compacted_text is the running summary. compacted_message_count is the watermark — the index up to which the summary already covers the transcript. That last field is doing more work than it looks: it's what makes compaction incremental. When the summarizer next runs, it only needs to consider messages[compacted_message_count:]- the turns that have happened since the last summary - and fold them into the existing summary. You never re-summarize the whole history, which keeps the operation O(new turns) rather than O(total turns).
This three-field shape is worth pausing on, because it's the design choice that everything else hangs off:
- The transcript stays intact. Compaction is additive, not destructive. Audit, analytics, debugging, and re-processing all still have the full record.
- The summary is a derived artifact. It can be regenerated, evolved, or thrown away without losing source-of-truth data.
- The watermark makes the operation idempotent. A retried worker job that compacts the same boundary twice produces the same result.
Trigger: Every N User Messages, Guarded
The trigger has to answer two questions: when to compact, and how to avoid compacting the same boundary twice.
def should_compact(messages, previous_messages, *, every_n_user_messages: int = 10) -> bool:
if every_n_user_messages <= 0:
return False
current_count = _count_user_messages(messages)
if current_count == 0:
return False
if previous_messages is not None:
previous_count = _count_user_messages(previous_messages)
if current_count <= previous_count:
return False
return current_count % every_n_user_messages == 0Three things are worth noticing here.
First, the cadence is based on user messages, not total messages. That keeps the schedule intuitive — “summarize every ten user turns” — and prevents assistant verbosity from accelerating compaction.
Second, the previous_messages guard prevents the trigger from firing if the chat has not actually advanced. That makes the worker safer to rerun after a partial failure.
Third, the modulo check gives clean deterministic boundaries: 10 user turns, 20 user turns, 30 user turns, and so on.
For simple chat flows, exact modulo boundaries are enough. In systems where workers may skip turns or messages can be appended in batches, threshold-crossing logic is more robust:
current_count // every_n_user_messages > previous_count // every_n_user_messagesThat version triggers when the chat crosses a compaction boundary, even if it does not land exactly on the boundary during a worker run.
The Compaction Prompt: Prioritize Before You Trim
You are a chat compaction engine. Produce an updated compacted_text that
preserves the conversation's essential context while minimizing tokens.
Output plain text only. Use dense keywords and short phrases separated by
semicolons.
PRIORITY ORDER — always preserve these first:
(1) user goals and constraints,
(2) confirmed decisions,
(3) open questions and next steps,
(4) key entities/names/dates/numbers,
(5) user preferences.
Drop resolved items before dropping open ones.
When updating from a previous compacted_text, retain all entries that are
still relevant and only remove items that have been explicitly resolved or
contradicted.
Omit filler, greetings, and AI suggestions the user did not confirm.A few engineering choices are baked into that prompt:
- An explicit priority ordering turns the summary from a freeform recap into a forced ranking. When the budget is tight, the model has unambiguous guidance about what to keep and what to drop.
- A drop-resolved-before-open rule biases the summary toward forward-looking state — the things the assistant still needs to act on — instead of historical narrative.
- A dense semicolon-delimited format ("user wants X; deadline next Friday; budget $400; prefers email; awaiting confirmation on Y") packs roughly twice the information per token compared to prose.
- "Omit AI suggestions the user did not confirm" is the one rule most easily overlooked. Without it, summaries quickly fill up with hypothetical advice the assistant offered and the user ignored, crowding out actual user-confirmed state.
The summarizer should run with low temperature and a strict output budget. The goal is not creativity; it is stable state compression.
Folding In, Not Replacing
The prompt sent to the summarizer contains the previous summary plus the new dialogue:
prompt_parts = []
if previous.strip():
prompt_parts.append(f"Previous compacted_text:\n{previous.strip()}")
prompt_parts.append(f"Conversation:\n{dialogue}")
prompt_parts.append(f"Return updated compacted_text only. Max {char_budget} characters.")The summary is therefore a running artifact. The model is asked to produce an updated version of the existing summary given the new turns, not to summarize the whole transcript from scratch.
This has two practical benefits.
First, the summarization input is bounded by the previous summary plus new turns, rather than the full conversation history.
Second, the summary is more stable. Context that was important on turn 30 does not randomly disappear on turn 60 just because the summarizer was given a different slice of the conversation.
A Budget That Scales With the Conversation
The summary itself needs a ceiling. Without one, the summary becomes the same runaway artifact the system was meant to avoid.
A fixed ceiling is not ideal either: a 200-turn conversation can legitimately have more durable context than a 20-turn conversation.
The implementation uses a piecewise-linear budget:
COMPACTION_BASE_CHARS = 1500
COMPACTION_CHARS_PER_20_MESSAGES = 300
COMPACTION_MAX_CHARS = 3000
def _compute_char_budget(total_messages: int) -> int:
return min(
COMPACTION_BASE_CHARS + (total_messages // 20) * COMPACTION_CHARS_PER_20_MESSAGES,
COMPACTION_MAX_CHARS,
)Short chats get a tight 1.5k-character budget. Longer chats get more headroom, but only up to a hard 3k cap.
The cap is the important part. It means the worst-case prompt size is bounded:
compacted_text + recent turnsrather than growing with the entire transcript.
If the model returns more than the budget, the output is trimmed at the last semicolon boundary rather than mid-phrase:
def _truncate_at_boundary(text: str, max_chars: int) -> str:
if len(text) <= max_chars:
return text
truncated = text[:max_chars]
last_semi = truncated.rfind(";")
if last_semi > 0:
return truncated[:last_semi].rstrip()
return truncated.rstrip()Boundary-aware truncation pairs naturally with the dense semicolon-delimited format. You drop whole entries instead of mangling the last one.
Inference-Time Assembly
When the next user message comes in, prompt construction prefers the compacted form when it exists:
def _build_compacted_dialogue(instance) -> str:
if instance.compacted_text and instance.compacted_text.strip():
compacted_count = getattr(instance, "compacted_message_count", 0)
if compacted_count > 0:
messages_since_compaction = instance.messages[compacted_count:]
else:
messages_since_compaction = instance.messages
recent_dialogue = _build_dialogue(messages_since_compaction)
if recent_dialogue:
return (
f"[Previous conversation summary:\n"
f"{instance.compacted_text.strip()}]\n\n"
f"{recent_dialogue}"
)
return (
f"[Previous conversation summary:\n"
f"{instance.compacted_text.strip()}]"
)
return _build_dialogue(instance.messages)This is the payoff of the system: the model receives a short, dense summary of everything older than the last compaction boundary, plus the latest turns verbatim.
Older raw turns are still stored for analytics, audit, and debugging, but they no longer travel in the prompt on every request. Short conversations still use the full dialogue path, so compaction does not add unnecessary indirection before it is useful.
Where It Runs
A detail that matters: compaction is not on the request path.
In this implementation, an async worker generates and persists the assistant response first, then runs compaction afterward:
# inside the worker, after the assistant response is generated and persisted
if should_compact(chat.messages, previous_messages):
logger.info("Running compaction for chat %s", chat.id)
compacted = compact_chat(chat, openai_util=openai_util)
if compacted:
chat.compacted_text = compacted
chat.compacted_message_count = len(chat.messages)Two useful properties fall out of that placement.
First, compaction latency does not block the user-facing response. The user gets their reply immediately, and the summary is refreshed afterward for future turns.
Second, compaction is failure-isolated. If the summarizer fails, the chat should not fail with it. The system can log the error, keep the existing summary, and try again on a later compaction run.
Production Properties
The design is intentionally boring. The useful properties come from the constraints around it:
- The raw transcript remains append-only source-of-truth data; compaction is derived state.
- The watermark bounds each compaction job to new dialogue plus the existing summary.
- The prompt encodes a retention policy, not just a request to “summarize.”
- The summary has a hard ceiling, so prompt size stays predictable.
- Compaction runs outside the request path, so failure affects memory freshness rather than chat availability.
The goal is not to make the assistant remember everything. The goal is to preserve the small amount of state that still matters, while keeping cost and latency bounded as the conversation grows.