01
The chat · what you use every day
System is set once, then User / Assistant take turns
System is the app’s house rules, set once per chat. Then User (you) and Assistant (it) take turns. The turns pile up. Nothing else is in there.
02
The essence · what it actually does
Everything is joined into one text; it guesses the next word
Remove the boxes and the pieces become one long string. The model reads the string and continues from the end, one word at a time. However fancy the AI looks, this is the job.
So how does it “guess”? Let’s flatten the view.
02
The essence · through its eyes
Flattened out: one string, new words popping out at the tail
Bubbles and roles are for you. The model sees one long string of tokens. It reads the whole string, then emits the next word at the end. That is the typing effect you see.
02
How it guesses · ① attention
Before guessing, it looks back at what matters most
Attention is the heart of the transformer. Before picking the next word, it weighs earlier words differently. Here it leans on “To be, or not to be” and the rhythm of the line.
02
How it guesses · ② probability
It never knows the answer — it scores every candidate
After reading everything, it builds a probability table for the next word: arrows .71 / pains .12 / stings .05… “Guessing” means picking from that table. Because it is always guessing, confident does not mean correct. A hallucination is not a separate bug; it is this table doing its normal job.
02
How it guesses · ③ append, repeat
Pick one, append it, guess the next the same way
The chosen word is added to the end. Then it repeats the same process: guess, append, guess again. Word by word, the whole reply gets generated. That is the core of an LLM.
03
The constraint · one “head” first
It reads the honest way: every word × every word
One reader like this is called a “head”. It scores how each word relates to every other word and stores the result as a web: 100 words = 10,000 cells; 10,000 words = 100 million. What you put in is context. The web’s limit is the context length, fixed at build time.
03
The constraint · that was one head
There are dozens of heads — not split work, but angles
Head 1 does not read the front while head 2 reads the back. Every head reads the whole text, each from its own angle: rhythm, names, tone. Dozens run at once. It is faster, but no less work: dozens of webs, all computed. (The picture uses 64.)
03
The constraint · deeper layer by layer
That stack of webs is just layer 1 — dozens more follow
Layer 1 reads the raw text. Each later layer reads what the previous layer produced. Another web, one level deeper (the picture uses 80). The bill multiplies: words × words × heads × layers. Double the window, quadruple the bill. That is why the window needs a cap.
03
The constraint · the one hard limit
It can only read so much — and the middle goes muddy
The string has a hard cap: context length. The two classic complaints, big files choke it and long chats make it forget, both start here. The longer the string gets, the easier it is to miss the middle. The buried line is the one it reads past.
04
Stateless · there is no “just now”
The chat is an illusion: every message is a first meeting
Between replies, nothing stays in its head. A continuous chat is the app resending the history each time. The model rereads from the top and continues. That is also why long chats cost more. It has no state. What you send is what it has.
05
Feed it · start with a handoff list
First, list every piece of knowledge the job takes
Treat it like a brilliant new hire on a permanent first day. What would they need before taking over? Your goal, situation, private materials, latest progress, public knowledge and common sense. Do not sort who supplies what yet. List it all.
06
Cross off · what it comes with
Two items on the list can be crossed right off
Public knowledge and common sense: if it is public, common, and before the cutoff, it saw it in training. Already in its head. Cross those off. Four things remain for you to supply: goal / status / materials / constraints. Those go into the window.
07
Ask · information comes out in talk
Just talk — and for anything complex, have it ask you first
You do not need to fill the list at once. Information comes out in talk: it asks, you answer, the slots fill in. For complex work, start with “Before you start, ask me about anything unclear.” One round of questions saves rounds of rework.
08
Sand it · v1 is usually wrong
Toss “what went wrong” back as-is, round after round
The first version is rarely right. It was guessed into being. Paste the raw result or error back; it revises. Still wrong? Paste again. The chat grows, and the result gets sharper. Give symptoms, not diagnosis. Stop when it is good enough.
09
Save · don’t just close the tab
Have it distill the chat into a document — that is a Skill
If the job worked, add one last instruction: “Distill this chat into a reusable doc.” It extracts the context, final solution, and pitfalls, then writes a Skill. Experience accumulates as documents you can reuse.
09
Save · how it pays off
Next time’s handoff: the Skill fills the slots
Next time the same kind of job appears, drop in the Skill. The how-to, pitfalls, and house rules come with it. You only state this time’s goal and status. Five rounds last time; one round now. That is saved context compounding.
10
Fit the hands · tool
The model has no hands: it writes an instruction; a tool does the work
“Book me a meeting” — the model cannot reach your calendar. It writes a tool call on the notepad — check_calendar(Wed,…) — and stops. A program outside the model checks the calendar: a tool. The result returns as text. The model reads it and answers. The model only talks; the tool does the work.
11
Fit the memory · injection (RAG/Memory/Skill)
Memory is bolted on: it fetches, or someone writes
The model does not remember you by itself. Memory is added around it. Two routes: a tool fetches something, or the harness writes context onto the notepad — documents (RAG), preferences and past results (Memory), reusable how-tos (Skill). All arrive at fixed moments. Who writes them? Coming up.
12
The body moves · the loop (Agent)
The same model, fed a few more rounds = an Agent
With hands and memory attached, the system can run on its own: write → tool runs → result written back, round after round. None of it needs you; the notepad keeps growing. An Agent sounds mysterious. The mechanism is plain: the same model, fed a few more rounds.
13
The full skeleton · harness
The shell that assembles this body is the harness
The harness is the program that writes the notepad, offers tools, and runs the loop: Claude Code, Cursor, your AI app. “Who does the writing?” This is who. It can also run several loops as a workflow. The point: only the model is special. The rest is ordinary code, and you can shape it.
14
Reflexes & red lines · hooks
Fixed points on the loop come with sockets — hooks
“Hook” is an old programmer word: when a fixed event fires, your attached action runs. If dinner starts, “wash hands first” runs. The loop has sockets at key points: after you speak, before it acts, after a tool runs, and before wrap-up. Put gates, injections, and follow-ups there. A prompt asks. A hook enforces.