Picture a beach.
Now: imagine every grain of sand on that beach knitting themselves together in real time. Concentrate on each of the billions upon billions of miniscule particles, each unique to themselves. Focus on their route from the ground to the greater whole, stitching together something impossible without the role of every other grain.
You can’t. Of course you can’t! It’s absurd to try and visualise so many tiny components at such a scale. You can see the end product, and you get the gist of where all that movement and precision is going. But picking out every grain is an impossible task.
The same is true of data. Given how important it is to modern life, and the extent to which it surrounds us, data is incredibly difficult to actualise.
A 1MB file, tiny by modern standards, has over a million bytes of data within it; a 1GB file has over a billion. That incredible, impossible image on the beach is actually what’s happening in your laptop, and the final form is yet another presentation from marketing that you don’t want to read! It’s the spectacular and the mundane of modern life, all at once.
Now consider how people interact with data.
You click ‘download’ on a file, and the file appears. It’s a simple process: action and reaction. We know it travels from device to device, but it doesn’t really exist in a tangible way until the result presents itself to us as a file – or a personalised experience, a numerical pattern, or whatever else it might be.
That’s not much of a problem at a personal level. Most consumers don’t need to understand every aspect of the tech they’re using – just the final result.
Unfortunately, businesses also tend to be made up of human beings. And while consumers might not need to know what their data is doing at any particular moment, companies most certainly do.
There’s a need here to move away from tradition and begin thinking about data not as a static resource, but as something in constant flow. The businesses that can direct that flow to key decision-making areas in their business can make more accurate, better contextualised decisions, faster.
So: how do we do that?
Breaking the data dam
Convention dictates that many businesses, from startups to corporate giants, will capture huge volumes of data and only then begin trying to derive any sort of real-world value from it. Our individual limitations as consumers, awaiting the result so we can engage with it, simply scale up.
This necessitates waiting around for data to be in a condition and a structure that allows you to analyse, understand, and action it.
A data lake, for example, will happily store structured, semi-structured, and unstructured data from a myriad of sources, without needing to process or transform it. But all that data is just sitting there untapped until data scientists and analysts interrogate it. A recent study from Loughborough University suggests that up to 65% of the data we produce is ‘dark data’; i.e. data that’s never actually examined and put to use in a way that benefits the business.
Data can’t be a stationary resource that we collect at the end of the pipeline anymore.
Assuming you can identify the right data, convention would usually point a business towards batch processing, which requires you wait for data to arrive in storage and then wait for the right cut of it to be made available. And then wait for it to be cleaned. And then wait for it to be processed. And then wait for it to be analysed…
Waiting games are not business strategies. The more time that passes once a data point is created, the less value can be derived from it. The business impact of being minutes, hours, or even days behind an accurate understanding of your business can be catastrophic.
Ideally, you’re able to base every decision you make on real-time data. If we’re going to do that, we need a paradigm shift in terms of how we think about data, engaging with it before it reaches the end destination, and injecting as much value and efficiency as we can in transit. Data can’t be a stationary resource that we collect at the end of the pipeline anymore.
Enter: data streaming.
Go with the flow
Rather than a beach, now imagine yourself on the bank of a clear river — so clear that you can stick your head under the water and see everything flowing through it.
Recognising what’s coming towards you takes no longer than the milliseconds required for the light to hit your eye. You can count the different types of silver fish, or the number of trout, or pick out the odd message in a bottle – whatever is of interest to you at any given time.
That’s data streaming: the constant flow of data even as it’s being generated, and analysable in real time. You’re not waiting for data to come to a standstill anymore; you’re engaging with it even as it’s in motion.
Data streaming achieves this through an event-driven architecture, which is essentially a software pattern that allows systems to detect, process, manage, and react to real time events. Every single data point is treated as an event in and of itself, and each event is a trigger for the stream to pull more data from source.
The result is a self-perpetuating flow of data that’s being examined, categorised, and processed enroute to its destination. It will arrive at the right place more consistently, in better condition, and far faster than any batch processing approach would allow.
Like a data lake, data streaming can accommodate data from all sources, and in almost any format or volume. In fact, this doesn’t have to replace the lake at all — streams can connect and accelerate these different parts of the business as sources, too. The same goes for data fabrics or meshes, and other existing systems that were intended to make data more accessible.
That also applies to whichever cloud provider(s) you might be shackled to, and even many legacy technologies that have survived simply by being too difficult or expensive to replace or modernise. If you can get your data into the stream, the flow will take it where it needs to go.
Get moving
So: if data streaming can genuinely deliver on this, how can we then use that data to improve the business, both now and in the future?
It’s at this point that we look to AI and automation. Research has suggested that 61% of future jobs will be “radically transformed” by AI, while Confluent’s own data has found that 88% of UK decision-makers see data streaming platforms as a means of democratising access to AI across the business. Feeding these transformative systems with real-time data streams can dramatically change things for the better.
For example, think about the machine learning (ML) algorithms that drive AI. Traditionally, these algorithms have a finite knowledge span – they can accommodate new information until you take them off the training loop and ‘activate’ them. But if you have agency and control over a constant flow of data, you can use that flow to continue refining the ML model even as it performs its job. Once again, we’re escaping the static, and embracing data in motion.
Similarly, one of the things that Large Language Models (LLMs) like ChatGPT enable is the searching of incredible volumes of data through simple questions or statements. Imagine the impact and value that such an interface can deliver in terms of strategic decision-making when given dominion over a company’s proprietary data; your CEO can ask a database for specific information, delivered in moments, regardless of where it lives.
Whether it’s the nitty gritty of machine learning algorithms, or the glossy veneer of ChatGPT, being able to harness a constant flow of data isn’t just about best practice for data management. Bringing data to life, and life to data, can elevate everything else that a business already has available to it.
All of this comes from refusing to stand still; from not accepting that data is something we have to wait to be ready for us. If we can think beyond convention, the momentum of data in motion can move any business forward.
Peter Pugh-Jones
Peter Pugh-Jones is Director of Financial Services at Confluent. Peter works with the customer success team to help clients attain maximum value from data streaming.