Glossary of LLM terms — Hector Sanchez

A working glossary. The author maintains this as a quick-reference, not as a tutorial. If a term has more than one common meaning, the most common one is listed first. The glossary is not exhaustive; it covers what the author has actually had to look up.

Tokens

The unit a model reads and writes. Approximately 0.75 words in English. A 100,000-token context window holds about 75,000 words.

Context window

The maximum amount of text a model can consider at once. Includes the system prompt, the conversation history, and any retrieved documents. Modern large models range from 200K to 2M tokens.

System prompt

The instructions the model is given before any user message. The system prompt sets the model’s behavior, role, and constraints. It is the most leverage-bearing part of any prompt.

Prompt caching

A model feature that lets the provider cache a portion of the prompt (usually the system prompt or a large retrieved context) so subsequent requests pay only for the new tokens. Dramatically reduces cost for long-running conversations.

Temperature

A sampling parameter that controls how deterministic the model’s output is. 0 = always pick the most likely token; 1 = sample from the full distribution. Most production code uses 0 for determinism or low values (0.2-0.5) for slight variation.

Top-p (nucleus sampling)

An alternative to temperature. The model samples from the smallest set of tokens whose probabilities sum to p. Top-p=0.9 means the model picks from tokens that together account for 90% of the probability mass.

Tool use

The model is given a description of one or more external tools (functions, APIs) and decides when to call them. The call is structured (usually JSON) and the result is fed back into the model. Tool use is the foundation of agentic workflows.

Function calling

A specific kind of tool use where the tool is a function with typed parameters. The model produces a JSON object matching the function’s schema; the application executes the function and returns the result.

MCP (Model Context Protocol)

An open protocol for connecting models to tools and data sources. An MCP server exposes a set of tools; an MCP client (in a model runtime) discovers and calls them. Standardizes what was previously a per-vendor integration.

Embedding

A vector representation of text such that similar texts have similar vectors. Embeddings are the basis of semantic search, retrieval-augmented generation, and many other techniques. Dimensions range from 384 to 3072 depending on the model.

RAG (Retrieval-Augmented Generation)

A pattern where the model is given retrieved documents as part of its context, so it can answer questions grounded in those documents rather than its training data. Reduces hallucination at the cost of a retrieval step.

Agent

A system that uses a model to decide what to do next, possibly including tool calls, file writes, or other actions. The loop runs until the model decides the task is complete or hits a stopping condition. Agents vary from 20 lines of code to thousands.

Fine-tuning

Training an existing model on a small dataset to specialize it. More expensive than prompting but cheaper than training from scratch. Used when a model’s base behavior is not aligned with the task.

Distillation

Training a smaller model to mimic a larger one. The smaller model gets most of the larger model’s capability at a fraction of the compute cost. Common in production deployments.