Ben Newton - Commerce Frontend Specialist

My MCP Went Down Today. I Don't Want To Care About Infrastructure.

But the infrastructure doesn't care what I want.

My MCP Went Down Today. I Don't Want To Care About Infrastructure.

This afternoon, Google Cloud blocked Railway's account.

Their dashboard went dark. Their API went dark. Every service hosted on Railway, including the BlackOps MCP my customers depend on, went dark with it.

One vendor's bad day. Hours of my customers locked out of the tool they pay me for.

Here's the part I want to be honest about: I don't want to think about distributed architecture. I don't want to think about vendor redundancy. I don't want to think about failover topology. I want to build the product, ship features, and have the platform underneath just work. That's why I picked a managed host in the first place. The whole point was to stop thinking about infrastructure.

But the infrastructure doesn't care what I want.

When I migrated the BlackOps MCP off npx and onto a hosted server, I picked Railway over Vercel because Railway runs persistent Node processes and Vercel functions are serverless with execution limits. MCP runs over SSE. Serverless and long-lived streams don't usually play nice. That was the right call for the architecture I had then.

It became the wrong call the moment I bothered to check my assumptions.

When the outage hit, I went in expecting a stateful refactor to be the wall. It wasn't. SSE wasn't actually the blocker. Statefulness was. The MCP transport was holding session state in process memory it didn't need to hold. Once I ripped that out and made the transport stateless, Vercel functions were perfectly capable of serving it.

So I shipped it. New project on Vercel. Both URLs serving the same deployment: mcp.blackopscenter.com as the canonical and mcp3.blackopscenter.com aliased for backwards compatibility so existing customer connectors keep working. The install endpoint at blackopscenter.com/api/install updated to hand out the new URL. End-to-end validated through a real Claude.ai connector before cutover. PR #219 squash-merged to main. Zero customer impact.

The whole arc, from "Railway is down" to "migration complete and tested in production," took less time than the outage did.

A few things I'm taking from this:

  1. The cheapest moment to fix vendor risk is before the vendor breaks. I knew the dependency. I just hadn't acted on it.
  2. Check the assumption that locked you in. I'd written off Vercel six months ago because of execution limits and SSE. The actual blocker was a stateful transport I could have rewritten any afternoon. The "we can't move because of X" answer is worth re-litigating every quarter.
  3. Stateless code is portable code. The faster you can lift and shift, the less your vendor choice matters.
  4. Public dependencies become public liabilities. If your customers can see your MCP go dark, they're watching how you respond, not just what you build.
  5. Resilience is a feature, not a checkbox. It shows up in how fast you can move when the floor falls out, not in the diagram on your architecture doc.

I'm not anti-Railway. They're well-run, they were transparent through the incident, and the issue wasn't theirs. But the lesson isn't about Railway. It's about the gap between wanting to ignore infrastructure and being able to ignore it. Those are different things.

You earn the right to stop thinking about your stack by making it portable enough that you can stop thinking about your vendor.

I closed that gap today. Back to building.

I wrote this post inside BlackOps, my content operating system for thinking, drafting, and refining ideas — with AI assistance.

If you want the behind-the-scenes updates and weekly insights, subscribe to the newsletter.

Related Posts