- Model
- The parameters. Stateless - does next-token prediction and nothing else. Cannot do anything agentic on its own.
- Parameters
- The numbers inside a model - often billions - tuned during training. Everything the model knows lives in them. Also called weights.
- Training
- The process that sets a model's parameters by exposing it to vast amounts of text and adjusting to improve next-token prediction.
- Inference
- Running a trained model to generate output - what happens on every model provider request. Parameters stay fixed.
- Token
- The atomic unit a model reads and writes. Roughly word-sized but not exactly. Context window size, cost, and latency all count tokens.
- Next-token prediction
- What the model actually does. Samples one next token from the context, appends it, and runs again. Its only mode of operation.
- Non-determinism
- The same input can produce different output. A property of how models generate text and how providers serve requests.
- Model provider
- Whatever serves a model for inference. Usually remote (Anthropic, OpenAI, Google), but can also be local (Ollama, llama.cpp).
- Harness
- Everything around the model that turns it into an agent: tools, system prompt, context-window management, permissions, hooks.
- Model provider request
- One round-trip from the harness to the model provider. The harness sends context; the provider returns one response.
- Input tokens
- Tokens the harness sends on each model provider request. Billed at a lower rate than output tokens.
- Output tokens
- Tokens the model generates back. Billed at a higher rate than input tokens, since they cost more compute to produce.
- Prefix cache
- The provider-side store that lets consecutive requests skip re-processing a shared prefix, billing those tokens at a lower rate.
- Cache tokens
- Input tokens the provider has cached from a previous request via its prefix cache, billed at a much lower rate.