What is a Context Window? — AI Glossary

A context window is the maximum amount of text — measured in tokens — that a language model can read and consider at one time. Think of it as the AI’s working memory: everything within the context window is “visible” to the model when it generates a response. Anything outside it — older parts of a conversation, documents you haven’t included — is invisible.

Context windows have grown dramatically in recent years. GPT-3 had a 4,096-token window in 2020. Today, Claude’s context window reaches 200,000 tokens (roughly 500 pages of text), and Gemini 1.5 Pro offers 1 million tokens. This expansion has transformed what AI can do — from answering single questions to analyzing entire codebases, legal contracts, or books in one session.

Table of Contents

How Context Windows Work

When you send a message to an AI assistant, the model doesn’t just see your latest message. It sees the entire conversation history, plus any documents or context you’ve included — all of it packed into a single large input called the context. The model generates its response based on everything in that context simultaneously.

The context window is filled with three types of content:

System prompt: Instructions that tell the model how to behave (set by developers or the platform).
Conversation history: All the previous messages in the current session — both your messages and the model’s responses.
User input: Your current message, plus any files, documents, or data you’ve attached.

When the conversation exceeds the context window limit, something has to give. Different systems handle this differently: some truncate the oldest messages, some summarize older context, and some return an error. In all cases, the model loses access to information that was earlier in the conversation.

The Transformer architecture that powers modern LLMs processes attention across all tokens in the context simultaneously — which is why larger context windows require dramatically more compute. Processing 1 million tokens requires roughly 100x the compute of processing 10,000 tokens (attention complexity scales quadratically with context length).

Why Context Windows Matter

Context windows matter because they define what you can ask an AI to work with in a single session. A small context window means the AI forgets earlier parts of your conversation quickly. A large one means you can analyze a full legal contract, a codebase, or an entire book in one shot.

For businesses, context windows determine the architecture of AI applications. Small context windows require complex RAG pipelines to manage large knowledge bases. Larger context windows make it possible to just load everything and let the model find what’s relevant — simpler and sometimes more accurate.

A 2024 study from Stanford’s HAI found that models with larger context windows scored significantly higher on tasks requiring comprehension of long documents, with the improvement most pronounced on tasks where key information appeared in the middle of long texts.

Context Windows Across Major Models (2025)

Here are the context windows for major AI models as of early 2025:

GPT-4o: 128,000 tokens (~96,000 words / ~400 pages)
Claude 3.5 Sonnet: 200,000 tokens (~150,000 words / ~600 pages)
Gemini 1.5 Pro: 1,000,000 tokens (~750,000 words / ~3,000 pages)
Llama 3.1 (70B): 128,000 tokens
Mistral Large: 128,000 tokens

Limitations and Best Practices

“Lost in the middle” problem: Research from Stanford (2023) showed that LLMs tend to pay most attention to content at the very beginning and very end of the context window. Information buried in the middle of a very long context is more likely to be overlooked. This means that even a 1 million token context window doesn’t guarantee perfect recall of all 1 million tokens.

Cost scaling: Larger context windows are more expensive to use. If you’re paying per token, loading 100,000 tokens of context for a simple question is wasteful. RAG remains the right architecture when you need to search a large knowledge base efficiently rather than loading everything at once.

Context management in long conversations: For multi-turn conversations, periodically summarizing earlier context (and replacing it with the summary) is a common technique to stay within context limits while preserving the thread of conversation.

For further reading, see the context window overview at Grokipedia, the Stanford “Lost in the Middle” paper at arXiv, or our article on what is a token for the foundational concept.

Key Takeaways

In one sentence: A context window is the maximum text an AI can consider at once — its working memory, measured in tokens.
Why it matters: Context window size determines what you can analyze in one session and shapes the architecture of AI applications.
Real example: Claude’s 200,000-token context window can fit a 600-page novel, allowing you to ask questions about the whole book at once.
Related terms: Token, RAG, LLM, Transformer

Frequently Asked Questions

What happens when I exceed the context window?

The model loses access to the oldest content in the context. In chat interfaces, older messages get silently truncated — you may notice the model “forgets” things you said early in a very long conversation. In API calls, you’ll receive an error if you try to send more tokens than the limit allows.

Is a larger context window always better?

Not necessarily. Larger context windows cost more to use (more tokens processed = higher API cost), can lead to “lost in the middle” issues where important information gets overlooked, and can slow down responses. The optimal context size depends on your task — use just enough context to give the model what it needs.

Does the context window reset between conversations?

Yes. Each new conversation starts with an empty context. The model has no memory of previous conversations unless you or the application explicitly provide that history in the new session’s context. This is different from long-term memory, which some products (like ChatGPT Memory) implement on top of the base model.

How do I know if my content will fit in the context window?

A rough calculation: words × 1.33 = approximate token count. For precise counts, use a tokenizer tool. Most consumer AI interfaces will display a warning when you’re approaching the context limit. For API users, checking token counts programmatically before sending requests is standard practice.

What is the difference between context window and memory?

The context window is the model’s immediate working memory — what it can see right now, which resets after each session. “Memory” features (like ChatGPT Memory) are a separate layer — the platform stores summaries of past conversations and injects them into new sessions’ context automatically. It’s a workaround for the context window’s session-limited nature.

What is a context window in AI?

The context window is the maximum amount of text — measured in tokens — that an AI model can read and consider at one time. Everything you put in a conversation, including previous messages, system instructions, and retrieved documents, must fit within this window. Claude 3’s context window is up to 200,000 tokens; GPT-4o supports up to 128,000. Larger windows let models handle book-length documents and longer conversations.

Why does AI forget things?

LLMs don’t have persistent memory between conversations by default — each session starts fresh. Within a single conversation, the model can only ‘see’ what fits in its context window. Once a conversation exceeds the window limit, the oldest messages get dropped, which is why the model appears to forget earlier parts of a long chat. Solutions include summarizing old context, using external memory stores, or choosing a model with a larger context window.

Want to learn more AI concepts?

Browse our complete AI Glossary for plain-English explanations of every AI term, or get our Beginners in AI Report for free updates.

Get free AI tips delivered daily → Subscribe to Beginners in AI

Learn Our Proven AI Frameworks

Beginners in AI created 6 branded frameworks to help you master AI: STACK for prompting, BUILD for business, ADAPT for learning, THINK for decisions, CRAFT for content, and CRON for automation.

Ollama vs LM Studio on My Mac

How to Turn Off Microsoft Copilot

Best AI Prompts for Insurance