bugs/railway-startcommand-stdio.md

Railway: chained migrate && server blocked port detection

Symptom

API deploys built fine and reported "Migrations complete." in logs, then the deploy was marked FAILED. Public domain returned:

{"status":"error","code":404,"message":"Application not found","request_id":"..."}

No healthcheckFailedAt, no exitCode. The container just vanished after migrations ran.

Root cause

The Dockerfile's CMD was:

CMD ["sh", "-c", "node dist/db/migrate.js && node dist/server.js"]

Two issues compounded:

Railway's stale UI override — the service was initially auto-detected and Railway had cached startCommand: "pnpm --filter @hive/api start" from that. The runtime image doesn't have pnpm. Pinning startCommand in railway.json overrode the UI setting.
Port detection lost the process — once startCommand worked (node dist/db/migrate.js && node dist/server.js), Railway's port-detection sniffed the migrate process, saw no listening socket, and marked the deploy unhealthy. The shell then forked the server, but Railway didn't re-detect the new listening port. Result: server was running fine inside the container; Railway's edge couldn't find it.

Fix

Move migrations to preDeployCommand — Railway runs this in a separate one-shot container BEFORE the main service starts, then spins up the main container with just the server:

{
  "deploy": {
    "preDeployCommand": "node dist/db/migrate.js",
    "startCommand": "node dist/server.js",
    "healthcheckPath": "/health",
    "healthcheckTimeout": 30,
    "restartPolicyType": "ON_FAILURE",
    "restartPolicyMaxRetries": 5
  }
}

If migrations fail, the deploy fails before any traffic is routed. If they succeed, the main container has a clean lifecycle — Railway detects port 4000 immediately.

Commits: ec1502c (pin startCommand) + 2aa35ea (preDeployCommand split).

What made it hard to spot

Migrations succeeded. The last log line was "Migrations complete." — looks like a healthy boot.
No exit code, no healthcheck failure. Railway's status reported FAILED with deploymentStopped: true but exitCode: null and healthcheckFailedAt: null. The container died for a "platform-level" reason that wasn't surfaced.
The server worked when I docker-ran the image locally with sh -c "node dist/server.js" — bypassing migrations isolated the problem to the chained command, but the symptom there was just "no logs after Migrations complete" rather than an exception.
Railway has both UI overrides AND railway.json — the precedence isn't always intuitive. Comparing meta.fileServiceManifest.deploy vs meta.serviceManifest.deploy in railway status --json shows what's actually being applied vs. what the config file says.

Lesson: Railway expects one process per container. If the entrypoint forks or chains, port detection can lose the listening process. The canonical pattern is preDeployCommand for one-shot setup work + startCommand for the long-running server.