Challenges in Data-Intensive Applications

2023-07-01

Most web-based applications and systems we interact with daily are not CPU-bound, meaning the latency that we experience is not due to a huge amount of data being crunched.

Instead, latency can be usually attributed to the time it takes to get all the required data needed to fulfill a request, or for completing required data manipulation operations.

Kleppmann call those data-intensive applications, where the primary technical challenge is dealing with the data.

Still, this challenge can come in many forms:

The challenge can be related the volume of data, so plenty of storage would be needed.
Maybe the usage patterns can allow us to keep fresher data in hotter storage and archive older data to colder storage.
The application may require achieving low-latency reads, so data must be arranged ahead of time for easier retrieval (indexing).
Maybe there's rapidly changing data and we need to propagate changes near real-time to plenty of clients (stream processing).
Maybe we need to crunch a large amount of historical data to generate summarizing reports at predictable intervals (batch processing).
The data may represent relationships between entities, which must be kept consistent on every operation (relational databases).
Maybe data is dealt with in individual blocks at a time, which may be handled independently to each other (key-value or document databases).
Storing the the result of expensive or frequent operation for fast retrieval (caching).
Keeping track of tasks to be performed asynchronously, and distributing them as computing resources become available (job queues).

References

Kleppmann, Martin -- Designing Data-Intensive Applications (2018).