Did you know that we are living in the Zettabyte era? IDC expects 175 zettabytes of data worldwide by 2025
. With data processing growing rapidly in the last decade, high performance standards for data applications have been established. As a result, running analytics on data is crucial to discover actionable insights for businesses today.
When it comes to data processing, stream and batch processing are the two most common approaches. Neither is superior, and when you employ each comes down to the individual business requirements.
This article will explain the differences between batch and stream processing. The term ETL will be used throughout the article. It stands for extract, transform, load, meaning that you extract the raw data from a source, transform it by deduplication, merging or other methods, and finally load the data into the target database.
Batch processing vs. stream processing
Let's define batch and stream processing before diving into the details.
With batch processing, the data first gets collected in batches. Large, finite quantities of data are processed at once, albeit with a gap for the collection to take place. In simpler terms, the data collection is separate from the processing.
With stream processing, there is no gap between collection and processing. The data is processed in real time, one bit at a time in continuous streams with instant response. To sum up:
- In batch processing, data is first collected as a batch, and then processed all at once.
- In stream processing, data is processed in real time as data enters the system, withno wait time between collecting and processing.
Both processing methods have different use cases, benefits, and limitations.
When to use batch processing
So, what are the key factors when it comes to selecting between batch processing and stream processing, and when should you pick a particular one?
Data volume and the batch job's time sensitivity are the most commonly cited factors when considering batch processing. In batch processing, we wait for a certain amount of data to pile up before running an ETL job. Usually, this interval varies from a few minutes to a few days, and in the vast majority of cases, the ETL job is scheduled to run during off-peak hours, allowing application infrastructure to be used for higher-priority tasks during business hours. This frees up personnel to work on more pressing tasks.
A batch process is best suited if the data volume is large, but known and finite, and if business use cases do not require a high degree of data freshness. In other words, it is acceptable to use historical data that was previously collected at a certain point in time to derive insights into the business, and there is no need to consider using real-time data. Batch processing is useful when performing repetitive, high-volume tasks that do not require much user interaction, and when the dataset requires complex computations or heavy analysis.
Specifically, batch processing is applicable in cases such as these:
- E-commerce order processing and fulfillment
- Billing or payroll
- Non-time sensitive reporting or analytics jobs
Payment processing every two weeks is an example of batch processing. Similarly, a company that bills its customers based on a monthly subscription model is a good use case for batch processing accounts data.
Disadvantages of batch processing
Batch processing can be highly efficient when performed properly, but due to its large scale, large-scale failures are more likely to occur.
They are monolithic in nature, complex to manage, and require very expensive scale-up SMP machines to run. Typically, they need to be manually tuned to meet specific configuration requirements, making them expensive to build, and hard to manage without specially trained staff.
Having corrupted data is another issue that can arise with batch-processing systems. If the system is down, the entire operation is halted until the experts can fix the problem. The result is lost productivity and higher maintenance costs. It is critical in these scenarios to have IT experts who can debug the system instantly.
Additionally, since batch processing has its own built-in level of latency, it is not suitable for use cases that require real-time data handling.
When to use stream processing
Stream processing is used when real-time analytics is needed, and data is generated as a continuous stream with high velocity. With stream-based data processing, logic is executed on the data instantly — or within a sub-second timeframe — upon hitting the storage layer. For the user, this is perceived as real-time processing.
Stream processing is most useful when split-second decisioning and visibility are important for the business. For this, data should be processed continuously and at high speed as it is generated in order to maintain a high degree of freshness.
Some use cases where stream data processing is best include:
- Fraud protection
- Machine learning
- Real-time system log and security monitoring
A practical use case of data streaming is fraud monitoring by credit card companies. It is a time-sensitive process, and the credit card processor must be able to detect suspicious fraudulent activity the moment it happens in order to reject a transaction.
Disadvantages of stream processing
One of the biggest challenges of stream processing is that it is difficult to implement at scale. For example, you need a built-in mechanism to provide resiliency against stream data imperfections, including missing and out-of-order data. Without adequate storage and processing resources, the system can easily get overloaded when a large amount of data suddenly appears and needs to be processed. This means that additional processing capacities are required to stay awake to meet the real time processing needs of the application.
Using the hybrid approach
In response to a multitude of different data use cases, there are an increasing number of companies that are opting for a hybrid lambda architecture approach — combining batch and stream data processing. Using this approach, non-urgent ETL tasks are handled by batch processing, and real-time data requirements are handled by streaming. However, it’s important to note that not all infrastructure is suitable for stream processing. For example, you need adequate compute capacity to process data on-the-fly in stream, and transparently distribute processing across multiple compute engines to achieve incremental scalability.
There is a better way — you can focus on designing your application, and leave streaming at the database layer to Fauna. Fauna is a serverless, developer-friendly database with streaming
capabilities. As a serverless database, Fauna lets you pay for what you use, use as much as you need, and protect against outages under high loads.
Sign-up for free
The data API for modern applications is here. Sign-up for free
without a credit card and get started instantly.
Quick start guide
Try our quick start guide
to get up and running with your first Fauna database in only 5 minutes!