What is Tokenization? — AI Glossary

Tokenization is the process of splitting text into smaller units called tokens that an AI model can process. Tokens can be words, word pieces, characters, or other text segments. Because AI models work with numbers, not text, tokenization is the essential first step that converts human language into a numerical format the model can handle.

You’ve likely encountered tokenization’s effects without knowing it — when an AI chat interface reports “token usage,” or when you notice that unusual words cost more tokens than common ones, or when a model seems to struggle with certain names or technical terms. All of these are downstream effects of how the tokenizer divides text.

Learn Our Proven AI Frameworks

Beginners in AI created 6 branded frameworks to help you master AI: STACK for prompting, BUILD for business, ADAPT for learning, THINK for decisions, CRAFT for content, and CRON for automation.

Table of Contents

How Tokenization Works

Modern language models use subword tokenization — splitting text into pieces that are smaller than words but larger than individual characters. The most common approaches are:

Byte Pair Encoding (BPE) — starts with individual characters and iteratively merges the most frequent pairs into new tokens until reaching a target vocabulary size. Used by GPT models.
WordPiece — similar to BPE but optimizes based on likelihood rather than frequency. Used by BERT.
SentencePiece — language-agnostic tokenization that treats the input as a sequence of Unicode characters, useful for multilingual models.

For example, the word “tokenization” might be split into [“token”, “ization”] — two subword tokens. A rare word like “antidisestablishmentarianism” might split into many more pieces. Common words like “the” are a single token. Each token is assigned a unique numerical ID from the model’s vocabulary (typically 32,000–100,000 entries).

Why Tokenization Matters

Tokenization shapes everything about how a language model perceives and processes text:

Context window — the context window is measured in tokens, not words. A 128K-token context window holds roughly 90,000–100,000 words, but this depends heavily on language and content type.
Cost — AI APIs charge by the token. The tokenization of your input directly determines cost. Verbose prompts with rare words cost more.
Performance on specialized domains — a tokenizer trained on general English text may split medical or legal terms suboptimally, requiring more tokens and potentially hurting model understanding.
Non-English languages — many tokenizers are biased toward English. Languages with different scripts or morphology may require significantly more tokens per word, effectively giving them a smaller effective context window.

Tokenization in Practice

Practical implications for developers and users:

Counting tokens — OpenAI’s tiktoken library and Hugging Face tokenizers let you count tokens before sending a request, enabling cost estimation and context window management
Prompting efficiency — concise prompts use fewer tokens and cost less. Removing redundant words and whitespace reduces token count.
Code tokenization — code uses specialized tokens differently than prose. Indentation, brackets, and keywords each consume tokens. Code-specific tokenizers (like those in Code Llama) optimize for programming language patterns.
Splitting long documents — documents exceeding the context window must be split into chunks for RAG pipelines. Chunk boundaries should respect semantic units, not just token counts.

Common Misconceptions

Misconception: One token equals one word. On average in English, one token is about 0.75 words (GPT models). Common words are one token; longer or rarer words may be multiple tokens. Non-English text typically requires more tokens per word.

Misconception: Tokenization is a solved problem. Current tokenization schemes have known limitations — they can make arithmetic harder (numbers may be split across tokens in ways that obscure their mathematical structure), struggle with certain languages, and create inconsistencies for names and technical terms.

Key Takeaways

Tokenization splits text into tokens — the basic units AI language models process.
Modern LLMs use subword tokenization (BPE, WordPiece) to handle any word with a fixed vocabulary.
Token count affects context window limits, API costs, and model performance on specialized text.
Roughly 1 token ≈ 0.75 English words (but varies by language and content type).
Non-English languages and technical domains often require more tokens per unit of meaning.

Frequently Asked Questions

What is a token in AI?

A token is the basic unit of text that an AI model processes. Depending on the tokenizer, a token may be a full word, a word piece, a punctuation mark, or a space. The model converts each token to a numerical ID, then to an embedding vector for processing.

Why do some words use more tokens than others?

Subword tokenizers assign single tokens to common patterns (frequent words, common prefixes/suffixes) and split rare patterns into multiple tokens. Rare words, technical jargon, names in non-Latin scripts, and very long compound words require more tokens than common English words.

Does tokenization affect AI output quality?

Yes. Models have been found to perform worse on tasks that require understanding of individual characters (like spelling, anagrams, or character counting) because the tokenization layer obscures character-level information. This is why models sometimes make surprising errors counting letters in words.

What is the vocabulary size of common models?

GPT-3 and GPT-4 use a vocabulary of 100,277 tokens (tiktoken cl100k_base). LLaMA 3 uses 128,256 tokens. BERT uses 30,522 WordPiece tokens. Larger vocabularies reduce the average number of tokens per word but increase the model’s embedding table size.

Free Download: Free AI Guides

Download our free, beautifully designed PDF guides to ChatGPT, Claude, Gemini, and Grok — plain English, no fluff.

Download Free →

How do I count tokens for OpenAI’s API?

Use OpenAI’s tiktoken Python library: import tiktoken; enc = tiktoken.encoding_for_model("gpt-4"); len(enc.encode(your_text)). Alternatively, use OpenAI’s online tokenizer tool at platform.openai.com/tokenizer to visualize how text is split and count tokens interactively.

Sources: Wikipedia — Tokenization · Hugging Face: Tokenizers Course · arXiv: Neural Machine Translation of Rare Words with Subword Units (BPE)

Keep expanding your AI vocabulary with the full AI Glossary or grab our Beginner’s AI Cheat Sheet.

What Are Gemini Gems? A Guide

Best AI Prompts for HR

What Is Google Gemini? A Guide