Context Window Manager
Why token tracking and conversation truncation matter for production AI apps.
Stack: Context Window Manager Category: Production Infrastructure
The Problem
Every LLM has a context window limit:
- GPT-4o: 128K tokens
- Claude 3.5: 200K tokens
- Gemini 1.5 Pro: 2M tokens
When you exceed this limit, one of two things happens:
- Hard failure — The API returns an error, your app crashes
- Silent truncation — The model forgets earlier context without warning
Both are bad user experiences. And they happen faster than you'd expect — a conversation with code snippets can hit 128K tokens in 10-15 exchanges.
Why Developers Struggle
- No visibility — You don't know how many tokens you're using until it crashes
- No warning — The app fails abruptly instead of degrading gracefully
- Complex counting — Tokens ≠ words, and different models tokenize differently
- No built-in solution — AI SDKs don't include token management
The Options
Option 1: Ignore It
Most common approach. Works until the conversation gets long enough.
Problem: Crashes in production. Bad user experience.
Option 2: Naive Truncation
// Keep only last N messages
const recentMessages = messages.slice(-20)Problem: Loses system prompt. Loses important early context. Magic number (20) doesn't account for message length.
Option 3: Character Approximation
// Estimate: 1 token ≈ 4 characters
const estimatedTokens = text.length / 4Problem: Inaccurate. Code, non-English text, and special characters throw off estimates. Still crashes.
Option 4: External Library
Use tiktoken or similar for accurate counting.
Problem: Extra dependency. Different libraries for different models. Still need to build the UI and truncation logic.
The Decision
We built a complete context window management system:
1. Real-time Token Tracking
const { tokenCount, percentage, isApproachingLimit } = useContextWindow({
model: "gpt-4o",
maxTokens: 128000,
warningThreshold: 0.8,
})2. Visual Indicator
<TokenMeter count={tokenCount} max={128000} />Shows users (and developers) exactly where they stand.
3. Automatic Truncation
const safeMessages = truncateToFit(messages, {
strategy: "keep-recent",
preserveSystem: true,
targetTokens: 100000,
})Trade-offs
| Decision | Our Choice | Alternative | Why |
|---|---|---|---|
| Token counting | tiktoken | Approximation | Accuracy matters at limits |
| UI feedback | Percentage + count | Just percentage | Power users want exact counts |
| Truncation default | Drop oldest | Summarize | Simpler, no extra API calls |
| Warning threshold | 80% | 90% | Earlier warning = more time to react |
What We Excluded
- Prompt compression — Requires separate API calls, complex
- Semantic chunking — Too application-specific
- Automatic summarization — Adds cost and latency