building a future-proof data pipeline with golang and kafka

Introduction #

The device data processing system is an essential component for business operations. It was, however, becoming unfit for purpose as new systems and teams needed access to the data. The previous system had various limitations, was limited to one cloud platform and was restrictive in how consumers could access the data. We decided to incorporate a new architecture by developing a new system, which presented a significant challenge for stakeholders. They needed quick and efficient access to crucial data but didn’t want to introduce what were considered legacy components into new systems.

We decided to rebuild the data processing pipeline. Device data arrives in high frequencies every second from many devices. The significant challenges of handling and processing data swiftly led us to choose Golang and Kafka. Golang and Kafka’s ability to manage high data throughput efficiently made it ideal for processing and publishing data.

Each message contains device readings. These readings can include frequency, state of charge, voltage, and many more. The readings are either tracked on a sub-second or second-level interval. The reading interval depends on the device configuration. The configuration can vary due to various factors. These include device types, sites, and other options. The aim was to build a new system that is robust and flexible. The system had to be able to handle the inconsistent data structures and high volume of data.

We chose Kafka as the new system’s core messaging platform, using it to streamline data flow between various cloud providers and platforms. This greatly improved our data access and utilisation within the organisation.

We’ve discussed our project’s challenges and goals. We will now delve into the technical solution.

Technical Deep-Dive #

At the heart of our system is a single Golang service. This service pulls messages from RabbitMQ. Then, it splits them into smaller messages.

In our first approach, we tried to define all the fields into a Golang struct. We then used Golang’s standard JSON encoding library to handle the decoding. However, we encountered several challenges due to the nature of the messages. Messages often varied with device configurations, leading to issues such as missing fields and data type inconsistencies, which ultimately caused decoding problems. These hurdles underscored the need for a new approach. Acknowledging these challenges, we decided to shift our message decoding strategy and refactor our message handling.

In the refactor, we adopted the tidwall/gjson package for message handling, enabling us to generically handle retrieval of fields from incoming messages. This approach ensured our system remained versatile, capable of adapting to varied data structures with ease. Utilising the gjson package for message decoding allowed us to eliminate the reliance on Golang structs and hard-coded fields, thereby significantly enhancing both performance and efficiency in decoding messages.

A standout feature of our revised system is its exceptional flexibility in field handling, eliminating the need for predefined fields. This design allows our service to seamlessly accommodate new fields as they emerge, without requiring updates or deployments to implement these changes.

With the technical framework in place, our attention turned to the crucial aspect of performance.

Performance Improvements #

Through custom observability metrics, we meticulously tracked the performance of our message handling logic, ensuring the system operated as anticipated. These metrics offered real-time data and insights, enabling us to pinpoint areas for improvement. Furthermore, they played a crucial role in measuring the impact of our ongoing changes and guiding our optimisation efforts.

Initially, our system relied on Golang structs and custom unmarshalling logic, leading to performance constraints. The variability in data from different devices and their settings introduced complexities that necessitated expensive type-checking logic within the code’s critical paths, slowing down message processing times. As a result, our throughput was capped at approximately 2 Megabits per second (Mbps). Although the performance was acceptable, each new variation in data required alterations or additional custom unmarshalling logic, rendering this approach unsustainable in the long run.

Recognising the need for improving performance and maintainability, we revisited our strategy. Using the metrics from our monitoring platform we were able to do the following:

Refactor RabbitMQ message retrieval
Identify and fixed bottlenecks with message parsing and handling
Develop an improved strategy for monitoring and benchmarking performance going forward

These improvements significantly boosted the service’s overall performance. Thanks to our continuous monitoring, we we’re able to make informed adjustments, leading to a notable increase in throughput to approximately 4 Mbps. Effectively, these changes doubled our previously recorded throughput and message handling capacity.

Addressing the next major challenge required us to investigate alternatives to using hard-coded Golang structs and custom unmarshalling logic. We continued to encounter new message types and discovered missing data fields, the need for a more adaptable approach became evident. Constant modifications to our message handling logic proved to be unsustainable. Thus, we shifted towards a generic solution for message processing, reducing our reliance on intensive type-checking. This transition not only streamlined message processing and simplified maintenance but also significantly decreased the time required to handle each message. The impact of this change was profound, nearly doubling our throughput once more to approximately 8/9 Mbps and enabling our service to process an average of 120,000 messages per second.

With these improvements in performance, we are now ready to explore future enhancements. The next phase of our project focuses on expanding our capabilities and exploring new features. We will also examine how we expose those to stakeholders in the future.

Future Scope #

Looking ahead, our journey with this data processing system is far from over. Successfully integrating data into Kafka is merely the initial step of an ongoing saga. The forthcoming crucial phase will involve the transformation of high-frequency data into aggregated streams. These streams will be meticulously segmented into specific time windows, meticulously designed to cater to the unique requirements of diverse teams across the organisation. By doing so, we aim to provide tailored data streams that enable departments to consume and leverage data more effectively, enhancing decision-making processes and operational efficiency. This strategic approach will not only streamline data accessibility but also ensure that each team has the insights needed to drive their initiatives forward.

A key focus will be on aggregating the data into specific time windows. These include 50 milliseconds, 1 second, and 30-minute increments. These varied time windows will help stakeholders have a comprehensive view of the data. They cater to both real-time monitoring and longer-term analytical needs. This level of detail will allow for more nuanced insights. It will also improve decision-making processes across departments.

Furthermore, we will focus on enhancing our benchmarking and test coverage. Our goal is to establish a robust framework. It can detect any regressions in performance proactively. We aim to maintain the high standards of efficiency and reliability that we’ve set.

Lastly, identifying our current system’s limitations is imperative. We intend to use the improved monitoring to analysis and pinpoint potential bottlenecks and scalability challenges. With a clear understanding of these limitations, we’ll devise mitigation strategies. These strategies may include adopting new technologies, revising our system architecture, or refining our data processing.

Conclusion #

Our project has enhanced data processing capabilities by harnessing the power of Golang and Kafka, effectively addressing the complexities associated with high-frequency data. Through incremental improvements, we’ve managed to increase throughput from 2/3 Mbps to 8/9 Mbps. Moving forward, our focus remains on further optimising data streams and scrutinising the scalability of our system. This effort has not only significantly improved our processing speed but also set a strong base for better handling of high-frequency data.