MCP design is prompt engineering all over again

MCPs are everywhere right now. And honestly, they’ve earned it.

They are an extremely good interaction layer for software, especially for workflows that you already use LLMs for. They’re also pretty easy to add. If you already expose an API, adding an MCP server on top is usually straightforward.

But while designing the MCP server for Simple Observability, we ran into a realization. MCP design feels a lot like prompt engineering all over again.

You find yourself spending a surprising amount of time tweaking descriptions, restructuring responses, renaming tools, hoping to improve the model’s behavior through trial and error. And none of this is measurable.

This post is not about the protocol itself. It’s about what happened when we tried to build our MCP server, and why we increasingly felt like we were doing prompt engineering in disguise.

Here’s the story of how we designed the Simple Observability MCP server.

A paradigm shift

To start, you first have to make a pretty uncomfortable paradigm shift.

As developers, we already know how to design APIs. We know what a clean architecture looks like. REST has decades of conventions and accumulated intuition behind it. Most of the time, you can look at an API and quickly tell whether it feels easy to use or not.

As humans, we also know how to design UIs. All of us can tell pretty easily when an interface feels clunky, overloaded, or not intuitive. Of course everyone has different tastes, but a really small number of humans can make UIs that can be used by billions.

But MCP design is a totally different beast. You’re not designing for a developer, or even a human, but for an LLM. And that changes everything.

So how do you evaluate whether an MCP design is good?

In practice, there are only two approaches.

The first is benchmarks. You build evaluation datasets internally and measure how models behave against them. But this gets hard very quickly. In our case, it meant being able to simulate real world incidents, and the realistic context for the agents (metrics, logs, etc) for the agent to actually exercise the tools meaningfully. All of this with a sufficient number of samples to have statistical significance.

The second approach (which is easier), is to lean heavily on best practices from the major labs. Tool calling is a learned capability baked into model training. That means that recommendations coming from the same orgs that train the models are a good hint about the data distribution the models were optimized for.

A lot of uncertainty

Another major difference is that the entire ecosystem is extremely recent.

Anthropic introduced MCP in late 2024. At the time of writing, the protocol is less than two years old. Adoption is growing quickly, but there’s still very little accumulated knowledge around what actually makes an MCP design good.

Ironically, LLMs themselves are really bad at it. They’re often mediocre at writing MCP related code, inconsistent with the surrounding frameworks and SDKs, and surprisingly unreliable at reasoning about which MCP designs work better for models in practice

And because model capabilities are improving so quickly, and SOTA leaderboard rankings change so often, there is an even deeper uncertainty hanging over all of this work.

Some would even argue that all of this will stop mattering in the future, and that models will just use the APIs as is.

Remember when people talked about “prompt engineering” being a new discipline? In a sense, a lot of MCP design today feels very similar to those early days of LLMs. People will discover patterns that (appear to) improve behavior, then a new model will be released and half of it will be irrelevant.

Visualizing what tools actually are

Before talking about design constraints, it’s useful to understand how an LLM “sees” an MCP server.

MCP has three main primitives: tools, resources, and prompts. Tools are by far the most important because they are the part specifically designed for the model itself.

After the initial handshake, the agent asks the MCP server for the list of available tools. The server responds with a structure that looks roughly like this:

{
  "jsonrpc": "2.0",
  "id": 2,
  "result": {
    "tools": [
      {
        "name": "list_open_incidents",
        "description": "\nLists open incidents from the platform.\n",
        "inputSchema": {
          "properties": {
            "limit": {
              "default": 10,
              "title": "Limit",
              "type": "integer"
            }
          },
          "title": "list_open_incidentsArguments",
          "type": "object"
        },
        "outputSchema": {
          "properties": {
            "result": {
              "title": "Result",
              "type": "string"
            }
          },
          "required": [
            "result"
          ],
          "title": "list_open_incidentsOutput",
          "type": "object"
        }
      },
      ...

But what does that actually mean at the model level? LLMs are fundamentally just token-in, token-out. So where do tools go?

They simply get injected into the prompt, as Anthropic’s documentation shows:

{{ FORMATTING INSTRUCTIONS }}
String and scalar parameters should be specified as is, while lists and objects
should use JSON format. Note that spaces for string values are not stripped.
The output is not expected to be valid XML and is parsed with regular expressions.
Here are the functions available in JSONSchema format:
{{ TOOL DEFINITIONS IN JSON SCHEMA }}
{{ USER SYSTEM PROMPT }}
{{ TOOL CONFIGURATION }}

The exact formatting differs between providers, but the important realization is the same: tool definitions become part of the context window. They consume tokens on every generation.

That changes how to think about MCP design entirely. Every tool name, description, parameter and example competes for context space alongside the actual user conversation and the tool results.

So you want the smallest possible tool surface area. That’s much harder than it sounds, especially for products with large APIs. Cloudflare published an interesting write-up about reducing the context footprint of their MCP server for exactly this reason.

This constraint becomes even more important with local models, where context windows are smaller.

Because of that, we intentionally did most of our testing on a very small local model: Gemma 3 4B. The reasoning is simple. If an MCP works well on a small model with limited context and weaker reasoning, it will usually work even better on larger models. But more importantly, because we think local agents matter long term, we want MCP servers that can realistically run on models you can host on a “regular” dev laptop, not only frontier APIs with massive context windows.

Designing the tool set

So let’s actually start designing the MCP. We start from the primary use case our users may have: investigating an open incident. The prompt might be as simple as: “I have an open incident, look into it”.

From there, the model needs to:

Fetch the list of open incidents, or retrieve the specific incident
Understand the entities attached to the incident
Fetch additional context about those entities
Understand what logs and metrics are even available to query
Query logs and metrics to investigate the issue

In our DB model, an incident is mostly a wrapper around other entities. The actual useful context lives elsewhere: servers, alert rules, jobs, etc.

So the naive MCP design naturally becomes something like this:

graph LR A[list_open_incidents] --> B[get_incident_details] B --> C[get_server_details] B --> D[get_alert_rule_details] B --> E[get_job_details] C & D & E -.- F(("Model Context")) G[get_available_metrics] --> I[query_metrics] H[get_available_logs] --> J[query_logs] I & J -.- F

This is already fairly reasonable. We have less than 10 tools, and the implementation is straightforward because it mostly maps onto our existing backend API. What’s great about LLMs is that the model handles all the orchestration itself. So the logic behind each tool is quite simple.

But can we make it even smaller?

For example, an incident already knows which entities are attached to it. So why expose multiple calls the model must chain together every time?

Instead of:

get_incident_details
get_server_details
get_alert_rule_details

we can collapse that into:

get_incident_context

Internally, that tool will do additional lookups and returns a fully hydrated view of the incident.

Same idea for observability sources.

In practice, the model almost always needs both the available metrics and the available logs before deciding what to query. So splitting them into separate tools mostly creates additional tool calls and more context overhead.

The resulting MCP became:

graph LR A2[list_open_incidents] --> B2[get_incident_context] B2 -.- F2(("Model Context")) C2[get_available_sources] --> D2[query_metrics] C2 --> E2[query_logs] D2 & E2 -.- F2

At this point, the entire observability workflow fits into five tools.

And this is where MCP design starts feeling strange compared to normal API design.

Traditionally, we try to maximize composability and separation of concerns. With MCPs you do the opposite. You compress workflows together because the real optimization target is not architectural purity, but reducing the cognitive and context burden on the model.

Designing the tools themselves

For each tool, the model essentially sees only three things:

the name
the description
the arguments

The name should be extremely explicit about what the tool actually does. You also have to remember that your MCP server will probably not be the only one installed in the client. A user may have dozens or hundreds of tools available. So namespacing your tool name is also very important.

Descriptions matter just as much. Anthropic recommends: “when writing tool descriptions and specs, think of how you would describe your tool to a new hire on your team”. In practice, this pushes you toward surprisingly verbose descriptions.

Arguments are where some of the most counter-intuitive choices come up. Take time ranges as an example. Our query_logs and query_metrics tools obviously need a time window. As developers, the “correct” API design might seem obvious: Unix timestamps. They’re standardized, compact, language-agnostic, timezone-safe, and universally supported. But LLMs are terrible at them. They routinely hallucinate timestamps, confuse milliseconds with seconds, generate invalid ranges, or fail simple timestamp arithmetic entirely. So despite Unix timestamps being arguably the cleaner engineering choice, the MCP-friendly design is often something much closer to natural language.

Looking ahead

You can now see why MCP design felt like prompt engineering all over again.

It feels difficult to design things for a non deterministic, probabilistic system, and try to shape the desired behavior with words instead of logic.

Maybe MCP will evolve. Maybe it gets replaced. Maybe models eventually make most of this abstraction unnecessary.

But the underlying problem does not go away: once you delegate actions to a model, which will happen, you stop designing APIs and start designing conditions under which desired behavior emerges reliably.

And that is a very different kind of engineering problem.

Simple Observability is a platform that provides full visibility into your servers. It collects logs and metrics through a lightweight agent, supports job and cron monitoring, and exposes everything through a single web interface with centrally managed configuration.

To get started, visit simpleobservability.com.

The agent is open source and available on GitHub