Challenges in Data-Intensive Applications
Most web-based applications and systems we interact with daily are not CPU-bound, meaning the latency that we experience is not due to a huge amount of data being crunched.
Instead, latency can be usually attributed to the time it takes to get all the required data needed to fulfill a request, or for completing required data manipulation operations.
Kleppmann call those data-intensive applications, where the primary technical challenge is dealing with the data.
Still, this challenge can come in many forms:
- The challenge can be related the volume of data, so plenty of storage would be needed.
- Maybe the usage patterns can allow us to keep fresher data in hotter storage and archive older data to colder storage.
- The application may require achieving low-latency reads, so data must be arranged ahead of time for easier retrieval (indexing).
- Maybe there's rapidly changing data and we need to propagate changes near real-time to plenty of clients (stream processing).
- Maybe we need to crunch a large amount of historical data to generate summarizing reports at predictable intervals (batch processing).
- The data may represent relationships between entities, which must be kept consistent on every operation (relational databases).
- Maybe data is dealt with in individual blocks at a time, which may be handled independently to each other (key-value or document databases).
- Storing the the result of expensive or frequent operation for fast retrieval (caching).
- Keeping track of tasks to be performed asynchronously, and distributing them as computing resources become available (job queues).
References
- Kleppmann, Martin -- Designing Data-Intensive Applications (2018).