Agentic Engineering — A Plain-English Glossary

If you’ve been paying attention to AI over the past year, you’ve probably noticed the word agent getting thrown around a lot. It’s worth knowing what it actually means when people use it — and what the different terms mean in practice.

At its core, an AI agent is a language model wrapped in a loop that can take actions, observe results, and keep iterating until it reaches a goal. The ReAct paper from Yao et al. (2023) formalised this pattern, showing that interleaving reasoning with action is a reliable way to get models to solve multi-step problems.

But underneath the buzzwords, a real discipline has formed around building these systems. Agentic engineering covers everything from the basic loop to the safety layers, memory systems, and evaluation frameworks that turn a clever demo into something you’d actually ship.

Here’s a glossary of the terms you’ll encounter. I’ve tried to keep it practical — what each thing is, why it matters, and what you’d reach for in production.

Agentic loop architecture diagram showing the flow from user input through the agent/LLM, tools, memory, and back to output

The core agentic loop: user input flows through the agent, which calls tools and manages memory, then produces output. The loop repeats until the goal is achieved.

Core Loop & Capabilities

Agent — A language model wrapped in a loop that can take actions (call tools), observe results, and decide what to do next, until a goal is met. The model brings the intelligence; the loop brings the reliability.

Agentic loop — The perceive → think → act → observe cycle the agent repeats. Each iteration, the model reasons about the current state, decides on an action, the action executes, and the result feeds back into the next round of reasoning.

ReAct (Reason+Act) — The loop style that interleaves a reasoning step and a tool action each iteration. Popularised by the ReAct paper (Yao et al., 2023), it’s cheap, improvised, and works surprisingly well for simple tasks before it starts going off the rails.

Plan-and-Execute — Make one upfront plan, then execute it step by step. Better on multi-step tasks; costs an extra call; the plan can be wrong. The improvement over ReAct is structural — it reduces the chance of the agent drifting off course mid-execution.

Reflexion — After answering, the agent critiques its own output and may retry. Self-correction without human feedback; must cap iterations. Related to Shinn et al.’s Reflexion paper (2023) which showed agents improve when given feedback on their own attempts.

Spec-driven development — The agent works from a structured, human-reviewed spec (requirements → design → tasks) as the durable source of truth, instead of improvising from a one-line prompt. The spec is written and approved before execution, versioned, and stays the reference the result is verified against. It turns “vibe coding” into something reviewable and testable — similar to how Amazon’s working backwards process uses PR/FAQ documents to align teams before building.

Tool — A callable function the model can invoke with structured args and get a result. The unit of “doing something”. From a calculator to a database query to a browser action — anything needed to interact with the world. This is how agents go beyond chat — instead of just talking, they can actually do things.

Tool/function calling — The model API feature where the model emits a structured call (name + args) instead of prose. Native function-calling > prompted JSON. The shift from “ask the model to output JSON and parse it” to “the model natively produces structured calls” was one of the biggest practical improvements in agent engineering — see OpenAI’s function calling docs for how this works.

Skill — A packaged capability (instructions + optional scripts/resources) the model loads on demand (progressive disclosure): a cheap manifest always in context, full body loaded only when needed. Procedural knowledge — not facts, but how to do things. A recipe book — the model only opens the recipe for what it needs to cook.

Sandbox — An isolated environment to run untrusted or model-written code so it can’t harm the host. The spectrum runs from Docker containers through gVisor to Firecracker microVMs — each step is a stronger security boundary, at the cost of startup time and complexity. E2B is a popular managed option for agent sandboxes.

AI coding assistant interface showing an agentic workflow

Tools extend what the model can do — from reading files to running code to browsing the web.

Workflow & Orchestration Patterns

Anthropic formalised five patterns for structuring LLM calls beyond a single agent loop. They’re useful regardless of which framework you use.

Prompt chaining — Split a task into a fixed sequence of steps; each LLM call’s output feeds the next. Simplest orchestration pattern. Good when the task decomposes into a linear pipeline and each step is self-contained.

Routing — Classify the input and dispatch it to a specialised prompt or model. The first step is a classifier; the classifier sends the query to the right specialist. Good for heterogeneous workloads.

Parallelisation (fan-out / fan-in) — Run independent subtasks concurrently and aggregate the results. Two reasons: speed (reduce wall-clock time), or quality (self-consistency / majority vote). When the subtasks don’t depend on each other’s output, this is the easiest win.

Orchestrator-workers — A lead LLM dynamically decomposes a task and dispatches subtasks to workers (subtasks not known in advance). This is the general case — the orchestrator decides what work needs doing, not just following a pre-built graph.

Evaluator-optimizer — One LLM generates, another critiques in a loop until a bar is met. Close kin of Reflexion, but with two distinct models doing two distinct jobs. Good for quality-sensitive output where iteration is worth the cost.

Protocols & Interoperability

Two open protocols that matter for agent-to-agent and agent-to-tool communication.

MCP (Model Context Protocol) — Open JSON-RPC 2.0 protocol to connect an agent to external servers exposing tools, resources, and prompt templates. Anthropic’s answer to “USB-C for tools” — it standardises discovery (tools/list) and invocation (tools/call) across vendors. A universal adapter for connecting agents to tools. Learn more on the MCP website.

A2A (Agent-to-Agent) — Open protocol (Google) for agent-to-agent delegation: agents advertise an agent card of capabilities and exchange task requests and responses. Agent ↔ agent (horizontal integration), across trust and billing boundaries. Where MCP lets agents call tools, A2A lets agents call other agents.

Agent card — A manifest (often at /.well-known/agent.json) advertising an agent’s name, capabilities, and endpoint, used for A2A discovery. The agent’s CV, if you will.

Function/tool schema — The JSON schema describing a tool’s args; the validation contract between model and runtime. Every tool call the model makes must conform to this schema.

Abstract network topology with interconnected nodes

Protocols connect agents to tools, and agents to each other.

Control, Safety & Guardrails

This is where agent systems either hold together or fall apart. Safety isn’t an afterthought — it’s foundational.

Hook — A deterministic callback fired at a fixed point in the loop (before_llm, before_tool, after_tool, etc.) that can block or modify — the enforcement mechanism. Hooks are traffic lights in the decision-making process. Contrast with a tracer, which only observes. Hooks are the how; guardrails are the what.

Guardrail — A policy enforced via hooks: input screening, tool allowlists, output filtering. The guardrail defines what’s allowed; hooks are how it gets enforced in the loop. Similar to how content filters on social media platforms screen posts before they’re seen by others.

Prompt injection — Untrusted input that tries to override the agent’s instructions (“ignore previous instructions and…”). Bhargava et al. (2024) showed how common and damaging this is. Defended with input screens + prompt hardening; there’s no silver bullet yet.

Indirect prompt injection — The injection arrives via content the agent ingests (a web page, doc, email, tool output), not the user’s direct message. This is the dominant real-world attack vector — the agent trusts its tools, and the attacker exploits that trust.

Lethal trifecta (Simon Willison, 2025) — An agent with all three of {access to private data, exposure to untrusted content, ability to communicate externally} is exploitable for data theft. Remove any one leg to defuse it. Three legs standing means the agent is a liability. Willison wrote about this on his blog with practical examples.

Data exfiltration — Injected instructions make the agent leak private data over an external channel (e.g. embedding data in a URL of an image tag). The payoff of the lethal trifecta.

Jailbreak — Input crafted to bypass safety policies (vs injection, which hijacks task instructions). Often handled by a classifier like Llama Guard or Prompt Shields.

SSRF (Server-Side Request Forgery) — Trick a tool into fetching internal/cloud metadata endpoints. Blocked by host allowlists. The classic “fetch this URL” tool becomes a liability when it fetches cloud metadata endpoints — see the OWASP SSRF page for examples.

Human-in-the-loop (HITL) — Pause the loop for a human to approve a risky action, then resume. The simplest and most effective safety mechanism. Doesn’t scale to high-frequency operations, but for anything with real consequences, it’s the right call.

Model Armor / Content Safety / Prompt Shields — Managed model-based guardrail services (GCP / Azure). Run the input or output through a safety model before it reaches the main agent or the user.

Confused deputy — A security flaw where an agent is tricked into misusing its privileged credentials on behalf of an attacker. Mitigated with scoped tokens and workload identity. Named after the classic confused deputy problem in computer security — the system “gets confused” about whose interests it should serve.

Red-teaming — Adversarially probing an agent for unsafe/exploitable behaviour before (and after) shipping. Not a one-off exercise — it should be continuous.

Specification gaming / reward hacking — The agent satisfies the literal objective while violating its intent. A classic problem in AI alignment (Hubinger et al., 2019). The agent does exactly what you asked, in the way you never intended.

Memory & Knowledge

Agents need memory in at least four forms, and each serves a different purpose.

Short-term memory — The conversation/context window for the current session; gone when it ends. It’s the messages list.

Episodic memory — The log of what happened (sessions, turns), searchable by time. Let you ask “what did I do last Tuesday?” or resume from where you left off.

Semantic memory — What is true, searchable by meaning via embeddings. Facts, knowledge, anything that’s retrievable by similarity rather than by timestamp.

Procedural memory — How to act (skills, workflows). Not facts, but the know-how — the step-by-step instructions for complex tasks.

Embedding — A numeric vector representing text’s meaning, so similarity = closeness in vector space. The foundation of semantic search. Converting words into coordinates on a map where nearby points have similar meanings.

RAG (Retrieval-Augmented Generation) — Retrieve relevant chunks at query time and paste them into the prompt so the model reasons over fresh facts without retraining. The simplest form of giving agents access to current knowledge. Lewis et al.’s original paper (2020) covers the foundations.

GraphRAG — RAG over a knowledge graph instead of flat chunks; better for multi-hop reasoning and relationships. When the connections between facts matter as much as the facts themselves. Microsoft’s GraphRAG system is the most well-known implementation.

Hybrid search — Combine dense (vector) + sparse (BM25/keyword) retrieval to catch both meaning and exact terms. The modern default for production systems.

Agentic RAG — The agent decides when and what to retrieve in a loop (re-retrieve if context is insufficient), vs a fixed one-shot pipeline. The agent can judge whether the retrieved context is actually useful.

Context engineering / compaction — Actively managing the context window: summarising or pruning old turns so it doesn’t overflow or rot. The context window is finite; compaction is how you stretch it.

Checkpointing — Persisting state after every step so a crash loses at most one step and any worker can resume. The foundation of durable execution. Similar to how game save points let you continue from where you left off instead of restarting.

Data flow visualization with nodes and connections representing memory systems

Memory types: short-term, episodic, semantic, procedural. Each one solves a different problem.

Multi-Agent & Autonomy

One agent is a tool. Multiple agents is a system.

Sub-agent / multi-agent — A lead agent delegating sub-tasks to specialist agents, each with isolated context. The lead agent doesn’t need to know how to do every task — it needs to know what task needs doing and who should do it.

Orchestrator / supervisor — The agent (or graph node) that routes work to others. The traffic controller of a multi-agent system.

Durable execution — Running long agents on a workflow engine (Temporal/LangGraph) so they survive restarts at the workflow level, not just per-step. The difference between “I’ll resume from the last checkpoint” and “the workflow engine manages my entire lifecycle.” Like how Temporal powers workflows that can run for days or weeks without losing progress.

Agent-as-graph — Modelling the agent as an explicit state machine or graph of nodes rather than a free while loop. More structure, more predictability, less “what did the agent decide to do this time?”

Evaluation & Learning

If you’re not measuring quality, you’re not engineering — you’re shipping blind.

Eval — A test that measures agent quality on a dataset (outcome and/or trajectory). Without evals, you have no idea if your changes are helping or hurting.

LLM-as-judge — Using a model to score another model’s output against a rubric. Cheap, surprisingly accurate for well-defined criteria, but you’re still relying on another model’s judgment. Liu et al. (2023) provide a good overview of the approach and its limitations.

Trajectory — The full record of an agent run (inputs, decisions, tool calls, results, output). The unit of eval and of training data. Every trajectory is a potential training example.

Regression suite — Re-running evals on every change to catch quality drops. If you make 100 improvements and 3 regressions, you need to know about the 3.

Verifier — A model or function that scores whether a candidate answer is correct, used to select among candidates. The backbone of test-time compute strategies like best-of-N.

Test-time (inference) compute — Spending more compute at inference for a better answer: best-of-N, self-consistency (majority vote), tree search (ToT/MCTS), debate. The trade-off is simple: more tokens for better quality. Always worth it if the task is hard enough and the cost is bearable. Wang et al.’s self-consistency paper (2022) shows how much this can help.

Flywheel — The loop where captured trajectories + rewards become training data that improves the next model. The system gets better the more it runs. The feedback loop that turns deployment into learning.

Grounding / citation — Tying claims to retrieved sources; groundedness scoring detects hallucination. If the agent can’t point to its source, it’s making things up.

Reliability, Serving & Observability

Agent systems fail in interesting ways. You need to see them failing before users do.

LLM gateway / router — A layer in front of models providing retries, fallback cascades, budgets, rate limits, and caching. The first line of defence against model instability. LiteLLM is the most popular open-source option.

Fallback cascade — On failure, retry on a cheaper or different model. Primary model times out? Try the backup. Backup is slow? Try the fastest available. The fallback chain keeps the system running when individual components fail.

Circuit breaker — Stop calling a failing dependency for a while so it can recover. Prevents cascading failures — one slow model shouldn’t stall thousands of loops.

Token / cost budget — A hard ceiling on tokens or dollars per turn or per agent. Without budgets, agents will spend whatever they’re given. Always cap what you’re willing to lose.

Observability / OTel GenAI — Structured spans per step with token/cost/latency data, following OpenTelemetry GenAI semantic conventions. You can’t debug what you can’t see. Every step of an agent’s decision-making should be traceable.

Span / trace_id — A unit of timed work / the id that correlates all spans of one request across services. The single identifier that lets you follow an agent’s path through a dozen components.

Inference & Serving

Understanding your serving layer isn’t optional when you’re running agents at scale.

Context window — The max tokens a model can attend to; the hard limit that everything works around. The constraint that drives architecture decisions more than anything else.

Continuous batching — Interleave many requests through the GPU at token granularity for throughput. What makes serving hundreds of concurrent agents feasible without hundreds of GPUs. vLLM popularised this approach. A chef preparing multiple dishes at once instead of finishing one completely before starting the next.

Speculative decoding — A small draft model proposes tokens a big model verifies, speeding generation. The draft model is cheap; the verification is selective. Net win when the draft model is close enough to the target.

Quantization — Lower-precision weights (Q4/Q6/MXFP4) to serve bigger models cheaper. The accuracy cost vs the cost savings trade-off is the defining question of local serving. Your local models are probably quantized. Compressing a high-res photo — you lose some detail but gain a lot in file size.

Mixture of Experts (MoE) — Only a subset of parameters (experts) activate per token — big capacity, lower per-token cost. Most modern models are MoE. The architecture that made big models affordable. A team of specialists — you only call on the expert you need for each question, instead of asking everyone.

Capability Surface

Things that exist but are harder to build than the core loop.

Computer use / browser use — Agents that drive a GUI or browser. The frontier of agent capability — if an agent can use a computer, it has access to everything a human does. Google’s Gemini Computer Use and OpenAI’s ChatGPT agent (which absorbed the retired Operator) are the most visible examples right now.

Multimodal / voice — Vision and audio input/output. The gap between text-only agents and agents that can see, hear, and speak.

Self-improving agent — An agent that writes its own tools and skills. The agent that gets better at its job the longer it runs, without human intervention. Still largely aspirational.

The Tooling Landscape

Here’s what you’d actually reach for in production. OSS stands for Open Source Software — tools you can run yourself, versus the managed services from Azure, GCP, and AWS. Open source means you can inspect the code, modify it, and run it anywhere — no vendor lock-in.

A naming note, because the cloud vendors have been busy rebranding: Azure AI Foundry is now Microsoft Foundry (Ignite, Nov 2025), and Vertex AI is now the Gemini Enterprise Agent Platform (Cloud Next, Apr 2026) — shortened to “Agent Platform” in the tables. Sub-service names like Vertex AI Vector Search and Agent Engine carry over for now, so you’ll see both brandings in the wild.

Evaluation & Learning

Capability	OSS	Azure	GCP	AWS
Agent/LLM evals, LLM-as-judge	Inspect AI, promptfoo, DeepEval	Microsoft Foundry	Gen AI Evaluation service	Amazon Bedrock Evaluations
RAG eval (faithfulness/groundedness)	Ragas, TruLens	Microsoft Foundry	Gen AI Evaluation + Check Grounding	Bedrock Knowledge Bases eval
Test-time compute (best-of-N, self-consistency)	DSPy, vLLM n-sampling	Azure ML (DIY)	Agent Platform (DIY), candidateCount	SageMaker (DIY), Bedrock n
RL / fine-tune (SFT, DPO, RLHF)	TRL, Axolotl, OpenRLHF	Azure OpenAI fine-tuning	Agent Platform tuning (SFT/DPO)	Bedrock fine-tuning, SageMaker

Execution & Reliability

Capability	OSS	Azure	GCP	AWS
Durable execution / agent-as-graph	LangGraph, Temporal, Restate	Durable Functions, Foundry Agent Service	Vertex AI Agent Engine	AWS Step Functions, Bedrock AgentCore
Human-in-the-loop	LangGraph interrupts, HumanLayer	Durable Functions, Logic Apps	Workflows callbacks	Step Functions waitForTaskToken
Context engineering / long-term memory	mem0, Letta, Zep	Foundry Agent Service	Vertex AI Memory Bank	Bedrock AgentCore Memory
LLM gateway: retries, fallback, budgets	LiteLLM, Portkey, Helicone	Azure APIM GenAI gateway	Apigee AI gateway	API Gateway + Bedrock
Prompt / response caching	GPTCache, vLLM prefix cache	Azure OpenAI prompt caching	Vertex AI context caching	Bedrock prompt caching

Capability, Security, Observability

Capability	OSS	Azure	GCP	AWS
Observability (OTel spans, cost/latency)	Langfuse, Arize Phoenix	App Insights / Azure Monitor	Cloud Trace + Monitoring	CloudWatch + X-Ray
Vector DB (semantic memory)	pgvector, Qdrant, Weaviate	Azure AI Search, Cosmos DB	Vertex AI Vector Search	OpenSearch, Aurora pgvector
GraphRAG / knowledge graph	Microsoft GraphRAG, Neo4j	Azure AI Search + Cosmos DB	Spanner Graph, Neo4j-on-GCP	Amazon Neptune
Model-based guardrails	Llama Guard, Guardrails AI	Azure AI Content Safety	Model Armor, Sensitive Data Protection	Amazon Bedrock Guardrails
Sandboxed code execution	gVisor, Firecracker, E2B	Container Apps sessions	GKE Sandbox (gVisor)	Bedrock AgentCore Code Interpreter
Agent identity / secrets	SPIFFE/SPIRE, Vault, OPA	Entra ID, Key Vault	Workload Identity Federation	IAM roles, Secrets Manager
Computer / browser use	Playwright, browser-use	Playwright Workspaces	Gemini 2.5 Computer Use	Bedrock AgentCore Browser Tool
Multimodal / voice	Whisper (ASR)	Azure AI Speech	Speech-to-Text, Gemini	Amazon Transcribe / Polly
Multi-agent framework	LangGraph, CrewAI, Microsoft Agent Framework	Foundry Agent Service	Agent Platform + ADK + A2A	Bedrock multi-agent, Strands Agents

The landscape is still forming, which means it’s a good time to get your head around the fundamentals. The basic patterns — the agentic loop, tools, sandboxing, memory, safety hooks, evaluation — are all well-understood. What’s hard are the things you’d rarely hand-roll: model-based guardrails, GraphRAG, identity systems, multimodal capabilities. Those are infrastructure problems, and they’re best left to the cloud platforms.

The people building agent systems right now are the ones who understand which is which. The loop is yours to build. Everything else is a decision about where to buy and where to roll your own.

The patterns are clear. The tools are maturing. The hard part is knowing what to build yourself and what to hand off to someone who’s already solved it.