The dominant architecture behind virtually every modern LLM is the transformer, introduced in the 2017 paper "Attention Is All You Need." Before transformers, language models relied on recurrent neural networks (RNNs) that processed text sequentially. Transformers process all tokens in a sequence in parallel using a mechanism called self-attention, which allows the model to weigh the relevance of every other token when representing any given token. This parallelism enabled training on vastly larger datasets and produced qualitatively stronger language understanding.
Tokens and Context Windows
Text is split into tokens before entering a language model. A token is roughly 0.75 words in English, so "understanding" becomes a single token while "supercalifragilistic" might split into several. The model never sees raw characters; it sees token IDs corresponding to its vocabulary, which typically contains 32,000 to 200,000 entries.
The context window is the number of tokens the model can attend to at once. GPT-4o supports up to 128,000 tokens; Gemini 1.5 Pro supports up to 1 million tokens. A longer context window means the model can process and reason over more text in a single pass, which is critical for tasks like analysing long documents or maintaining coherent long conversations.
How Inference Works
During inference (generating a response), the model takes the input tokens, processes them through dozens of transformer layers, and produces a probability distribution over its entire vocabulary for the next token. A decoding strategy then selects which token to emit. Greedy decoding always picks the highest-probability token. Sampling with temperature introduces randomness: a high temperature produces more diverse and creative output; a low temperature produces more deterministic, focused output. Top-p (nucleus) sampling restricts choices to the smallest set of tokens whose cumulative probability meets a threshold.
The most powerful LLMs use tens of thousands of GPUs or TPUs during training. Inference is cheaper but still computationally intensive, which is why running large models locally requires high-end consumer hardware.
Why Do AI Language Models Hallucinate?
Hallucination is the phenomenon where an LLM produces confident-sounding but factually incorrect information. The core reason is architectural: the model is optimised to produce plausible token sequences, not to verify claims against a reliable knowledge base. When the model encounters a query that probes the boundaries of its training data, it often generates a coherent-sounding but fabricated answer rather than admitting uncertainty.
Mitigation strategies include retrieval-augmented generation (RAG), which grounds the model in retrieved documents before answering; reinforcement learning from human feedback (RLHF), which trains the model to express uncertainty appropriately; and tool use, which lets the model delegate factual queries to a search engine or database.
What Is Language Model Perplexity?
Perplexity is a standard evaluation metric for language models. It measures how surprised the model is by a held-out test corpus: a lower perplexity means the model's probability estimates closely match the true distribution of text. A perplexity of 10 means the model is, on average, as uncertain as if it had 10 equally likely choices at each step. Perplexity is useful for comparing models trained on the same data distribution, but it does not directly measure task performance, which is why benchmark suites such as MMLU, HellaSwag, and HumanEval are used alongside it.



