Master Claude, Chapter 1: The Evolution of Large Language Models — From Markov Chains to Context Engineering

This is the first post in a chapter-by-chapter series on Master Claude Chat, Cowork and Code: From Prompting to Operational AI by Sho Shimoda. Each article previews one chapter — enough to shift how you think about working with AI, but only a fraction of what the full book covers.

Chapter 1 is the one I kept rewriting. Not because the material is complicated — it is, but that is not the issue. The issue is that most people skip it. They want the practical stuff: the prompting tricks, the automation workflows, the CI/CD integration. And I get that. But every single technique in the remaining nineteen chapters depends on understanding what a language model is actually doing when it generates text. If you skip this chapter, you will use Claude like a magic box. If you read it, you will use Claude like an engineer.


From Markov chains to Transformers

The chapter starts where most AI books do not: with Claude Shannon's experiments in the 1950s on the statistical structure of English text. Not because it is quaint history, but because the core idea has not changed. A language model predicts the next token based on the tokens that came before. Shannon did it with letter frequencies. N-gram models did it with word sequences. Modern Transformers do it with attention mechanisms across 100,000+ tokens of context.

The breakthrough was the 2016 paper "Attention Is All You Need." Before Transformers, models had a fixed-size memory window — a 5-gram model only looks at the previous four words. If the relevant context is fifteen words back, the model is blind to it. Transformers changed that by letting the model simultaneously attend to every part of the input, weighting what matters most for each prediction.

I walk through this evolution in enough technical detail that you understand why things work, without drowning in linear algebra. If you have ever wondered why GPT-1's 117 million parameters could barely finish a sentence while Claude's architecture can orchestrate a multi-file refactor across a codebase — Chapter 1 explains the structural reasons.

💡 Key idea: A language model does not produce text. It produces probability distributions over possible next tokens. Everything about how these models succeed and fail follows from that single fact.

Probability, entropy, and perplexity — the math you actually need

Here is the part most AI practitioners skip and most AI books gloss over. When you use Claude and ask "What is the capital of France?", the model does not retrieve an answer. It generates a probability distribution: "Paris" at 85%, "Lyon" at 3%, "The" at 2%, and thousands of other tokens at fractions of a percent. Then it samples from that distribution.

Entropy measures how uncertain that distribution is. Low entropy means the model is confident — one token dominates. High entropy means many tokens are roughly equally likely, and the model is essentially guessing.

Perplexity is the exponential of entropy. If a model's perplexity on some text is 100, it is as uncertain as if it had to choose uniformly among 100 options. Competitive models achieve perplexity around 20-30 on benchmark datasets — roughly 7-10 times more confident than random chance. Sounds good. But that remaining uncertainty is where hallucinations live.

This is not academic trivia. It is the reason vague prompts produce bad outputs. Every time you write an ambiguous instruction, you are pushing the model into a high-entropy region where multiple completions are equally plausible. The model picks one. Sometimes it picks well. Often it does not. And you blame the AI.

⚠ Warning: Hallucinations are not random errors. They are confident outputs from high-entropy regions — the model generates fluent, authoritative text in exactly the situations where it is least certain. Understanding entropy is the first step to preventing them.

The scaling story — and why it is hitting a wall

GPT-1 had 117 million parameters. GPT-2 had 1.5 billion. GPT-3 had 175 billion. GPT-4 is estimated at a trillion. Each generation followed the same pattern: more parameters, more data, more compute, dramatically better results. Researchers formalized this as scaling laws:

Performance ∝ (Data × Compute × Parameters)α

The pattern was clear and it was intoxicating. In 2022-2023, there was widespread expectation that if scaling continued, artificial general intelligence was just a few doublings away.

Several things happened. High-quality training data became the bottleneck — models were already trained on most of the text available on the internet. The gains from scaling started showing diminishing returns for certain classes of problems. And training costs became superlinear: each doubling of model size costs more than twice as much, because you need more communication bandwidth between GPUs and longer training runs.

I spend several pages on why this matters practically, not just theoretically. The short version: if you are waiting for a smarter model to solve your AI productivity problems, you will be waiting a long time. The next leap does not come from bigger models. It comes from you.

💡 Key idea: The era of explosive scaling is plateauing. Claude, GPT-4, and other frontier models represent a local peak in what pure scale can achieve. Future improvements come from better context, better instructions, and better system design — not just more parameters.

The rise of context engineering

This is the concept the entire book is built on, and Chapter 1 is where it is introduced.

Early GPT-3 had a context window of 2,048 tokens. Modern Claude has 100,000+. That is a 50x increase in how much information the model can consider when generating a response. It means you can feed the model your entire codebase before asking it to write new code. You can provide a 50-page API specification before asking it to implement a client. You can give it the full history of a project, all previous decisions, and all existing patterns.

If you have a 100,000-token context window and you only use 5,000 tokens of it, you are wasting 95% of your available leverage. That is what most people do. They type a short prompt, get a mediocre answer, and conclude that AI is overhyped.

Context engineering is the discipline of structuring what goes into that window — the right examples, clear problem definitions, concrete specifications, relevant constraints — so the model operates in a low-entropy environment where the correct output is overwhelmingly probable. This is also why operational AI — agents that can read your actual systems, access your real codebase, see error messages and logs — is so much more powerful than a chat window. The agent has access to your actual context, not context you have to manually transcribe into a prompt.

I will not spoil the specific frameworks the book introduces for structuring context — that is where the remaining nineteen chapters earn their keep — but I will say that the gap between AI beginners and AI power users is not about which model they use. It is about how they shape context. And that is a learnable skill.


Chain-of-thought and structured reasoning

The chapter closes with a section on chain-of-thought prompting and Claude's extended thinking capability. The core insight: when you ask a model to show its reasoning step by step, it uses its own intermediate outputs as additional context for subsequent predictions. The problem gets broken into smaller pieces where the model is less likely to make errors.

Claude's extended thinking feature lets you request that the model spend more computational resources reasoning before responding. It is worth it when the problem is genuinely difficult — analyzing a complex database schema, working through a multi-constraint architectural decision. For routine tasks, it is wasted resources and added latency.

The chapter explains both the mechanism and the trade-off, so you know when to reach for extended thinking and when to save your tokens.


What Chapter 1 sets up

By the end of this chapter, you will understand what a language model is actually doing when it generates text, why probability and entropy explain both the model's successes and its failures, why bigger models alone will not solve your productivity problems, why context engineering is the highest-leverage skill you can develop, and when chain-of-thought reasoning is worth the computational cost.

Every technique in the rest of the book — from XML-structured prompting to CLAUDE.md guardrails to MCP connectors to CI/CD automation — builds on these foundations. Chapter 1 is the chapter that makes every other chapter work.

Next in this series: Chapter 2 — The Three Pillars of Claude. We will break down how Chat, Cowork, and Code each serve a distinct role in the Claude ecosystem, and the decision framework for matching the right tool to the right task.


📖 Get the complete book

All twenty chapters, the full technical foundations on probability and entropy, context engineering frameworks, hands-on workflows for Claude Chat, Cowork, and Code, plus the CLI reference, CLAUDE.md templates, MCP examples, and security checklist.

Get Master Claude Chat, Cowork and Code on Amazon →

2026-03-02

Sho Shimoda

I share and organize what I’ve learned and experienced.