Skip to main content

Introducing Headcode: A Unified API for UK Rail Data

·7 mins

UK rail data is brilliantly open and freely available. It is also completely fragmented across dozens of discrete, historically siloed sources. If you want to build an application that tracks trains, you have to piece together obscure feeds from the Rail Data Marketplace, Network Rail Open Data, NaPTAN, and the Office of Rail and Road.

Much of this fragmentation goes back to the legacy of British Rail and the complexities of privatisation. You will encounter Darwin, the real-time passenger information engine operated by the Rail Delivery Group, which feeds the physical departure screens on the platforms. You will stumble into TRUST and TOPS, the legacy Network Rail systems that monitor real-time train progression against schedules and track individual pieces of rolling stock. You will see raw signals from Train Describer (TD) berths tracking train movements piece-by-piece via trackside axle counters.

Integrating directly with official rail data means wrangling huge CSV datasets, legacy XML endpoints, and complex Kafka streams while constantly translating complex data structures. There are open-source proxies like Huxley that wrap Darwin, and excellent mature tools built around this data like Realtime Trains for power users or OpenTrainTimes for live signalling maps. But if you are a typical developer who just wants a clean JSON payload of what trains are leaving a station right now, the barrier to entry is absurdly high.

I am building Headcode to solve this. The premise is straightforward: one integration instead of twelve, with intelligence baked in. Headcode consumes these scattered raw feeds, performs the complex reconciliation, normalisation, and enrichment, and exposes the result through clean REST endpoints. Downstream developers get the data they need without maintaining stream ingestion pipelines.

The name comes from the terminology used by the UK rail industry itself. A “headcode” is the official Train Reporting Number assigned to every single service to uniquely identify it to signallers and routing systems. For example, 1A65. The 1 indicates the class of train, such as an express passenger service. The A specifies the destination area. The 65 is the individual identifying number for that specific service. It is a precise identifier, which is exactly what a good API should provide.

The API experience #

I wanted the developer experience to feel like a modern platform, not a raw scrape of a government database. Every API request should return insights or connected data that you physically cannot get from a single raw feed alone. The service needs to add value in the middle layer.

One of the biggest points of friction in UK rail data is the identifier systems. The industry uses several overlapping codes for the exact same physical location:

  • CRS codes: the three-letter customer-facing codes you see on passenger tickets, like PAD for London Paddington.
  • TIPLOCs: seven-character codes used for routing and scheduling, like PADTON.
  • STANOX codes: a five-digit internal numbering system used by legacy signalling systems.

If you query a raw Darwin live feed, it will identify a station using a TIPLOC. If you query a retail ticketing feed, you will get a CRS. Headcode treats these identifiers natively and forgivingly. You can query the API with any of these codes. The service understands the cross-references and translates them on the fly, returning all known identifiers in the response payload. You do not have to maintain your own mapping tables synced with CORPUS or constantly query reference databases to translate what the API gives you. You ask for Paddington using PAD, and the API knows exactly what you mean.

The data returned is strictly unified. Headcode seamlessly blends real-time live data with static and historic datasets. When you pull a list of delays or the train order for a specific platform, the response is enriched with static accessibility information, station operator details, and physical coordinates sourced from NaPTAN. You get the whole picture in a single structured response, rather than having to stitch together four separate HTTP calls.

What you can build with it #

Right now, Headcode handles the heavy lifting of continuous ingestion. It takes in massive static reference dumps alongside live Darwin Kafka streams, translating thousands of real-time operational messages a second into structured relational data.

You can hit the API for live station departure and arrival boards directly derived from Darwin. You can pull comprehensive station metadata with fuzzy searching. If a user types “Paddingon” instead of “Paddington”, the API will still resolve the correct station and return the full metadata tree without throwing a 404.

There are specific endpoints for detailed service descriptions and network disruptions sourced from operational train alerts or station messages. You can query platform train-order endpoints to see exactly which trains are queued to arrive at a specific platform and in what sequence. Calculating platform order is surprisingly hard when you are looking at baseline timetables because trains run out of order all the time. Headcode does the active calculation based on real-time track progression and provides the corrected sequence.

The entire service is secured via API-key Bearer authentication sitting on a custom auth layer.

To provide a concrete example of what you can build on top of this platform, I am currently developing a Bluesky bot. It will act as an active webhook consumer of the Headcode API (webhooks are a planned feature coming soon to the platform). Instead of aggressively polling endpoints to check if a train is late, it will listen for events and automatically post live train information, severe network disruptions, and active delays for specific high-traffic routes. The bot serves as a demo of the platform in the wild, proving that you can build reactive social tools without needing to understand Kafka messaging protocols or trackside axle counter logic.

Under the hood #

The backend is written entirely in Go. The service is deployed as a monolith. I run multiple instances of the service, each responsible for a subset of the system. The system is dealing with constant firehoses of Kafka messages and concurrent HTTP requests. Go’s native concurrency model handles stream ingestion effortlessly without complex worker pool orchestration. The compiled binary gives me predictable memory usage and low-latency API responses.

The core product is served as REST JSON. I am using OpenAPI code generation to maintain strict contracts between the server responses and the generated documentation. I use ConnectRPC over HTTP/2 specifically for health checks and internal service state. ConnectRPC provides a lightweight, strictly typed protocol that is perfect for internal monitoring without pulling in the bloat of standard gRPC dependencies.

PostgreSQL serves as the backbone of the entire system, handling millions of messages per day. I use pgx for the driver and sqlc for database interaction. What makes sqlc brilliant is that it compiles plain SQL queries directly into type-safe Go code. It removes the need for unpredictable ORMs and keeps exact database access patterns explicit during code reviews.

I rely heavily on Postgres extensions to push the heavy lifting down to the database layer. PostGIS powers all the spatial queries. When an endpoint needs to find all stations within a specific physical radius of a GPS coordinate, PostGIS handles the geospatial maths natively and incredibly fast. For text searching, pg_trgm provides trigram matching. This is what enables the fuzzy searching on station names. Standing up and managing a separate Elasticsearch cluster just to handle user typos is unnecessary architectural overhead when Postgres handles it directly and efficiently.

The infrastructure is intentionally simple and vertically scaled. The stack runs on a Rocky Linux VM hosted on Proxmox. I chose Rocky Linux for its strict bug-for-bug compatibility with RHEL and sheer stability as a hypervisor guest. Everything is explicitly provisioned with Terraform. Caddy acts as the web server, and traffic is routed securely back to the internet via Cloudflare Tunnels.

It is a robust, low-maintenance setup ideal for the current scale of Headcode. I will likely migrate the workloads to something else to scale out the system in the future. But spinning up a monolithic VM gets the product built, shipped, and tested with real user traffic without fighting abstraction layers.

Taking it for a spin #

Because Headcode abstracts away the domain complexities of the UK rail network, developers can spend their time actually building transit applications instead of debugging XML parsers and cross-referencing obscure alphanumeric identifiers.

While Headcode is currently in closed beta, the documentation is completely open. You can read the comprehensive API documentation, explore the response schemas, and see how the endpoints are structured right now at docs.headcode.dev. I will be posting further updates on the Bluesky bot, API access availability, and new Headcode features over the coming weeks as the system expands.