Automatic Performance Tuning for Distributed Data Stream Processing Systems
Distributed data stream processing systems (DSPSs) such as Storm, Flink, and Spark Streaming are now routinely used to process continuous data streams in (near) real-time. However, achieving the low latency and high throughput demanded by today's streaming applications can be a daunting task, especially since the performance of DSPSs highly depends on a large number of system parameters that control load balancing, degree of parallelism, buffer sizes, and various other aspects of system execution. This tutorial offers a comprehensive review of the state-of-the-art automatic performance tuning approaches that have been proposed in recent years. The approaches are organized into five main categories based on their methodologies and features: cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. The categories of approaches will be analyzed in depth and compared to each other, exposing their various strengths and weaknesses. Finally, we will identify several open research problems and challenges related to automatic performance tuning for DSPSs.
The performance implications of tuning are well-known in the industry, with good configurations leading to significant performance benefits, while bad ones can cause severe performance degradation. A substantial amount of research has been introduced in the last few years for automating performance tuning in DSPSs using a variety of approaches, classified into five categories: cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. The above table outlines and compares the various key features and functionalities of approaches falling in the five categories in terms of modeling, need for statistics or runs, prediction accuracy, and adaptability to workload and system changes. These approach categories will be examined in depth and compared within the context of DSPSs during this tutorial.
The tutorial is planned for 1.5 hours (90 minutes) and will have the following structure:
- Introduction and motivation (10 mins). We introduce the problem of performance tuning in large-scale data stream processing systems and motivate the need for automatic tuning approaches with several applications/scenarios.
- Method taxonomy (10 mins). We present the taxonomy of automatic performance tuning approaches in DSPSs, including cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning.
- Parameter tuning approaches (50 mins). We introduce representative approaches from each category for tuning the performance of streaming applications.
- Comparison of approaches (10 mins). We compare the solutions in the various tuning categories and across the two streaming models, i.e., record-based and batched streaming.
- Open problems and challenges (10 mins). We discuss open problems and challenges for performance tuning.
- H. Herodotou, Y. Chen, and J. Lu, “A Survey on Automatic Parameter Tuning for Big Data Processing Systems,” ACM Computing Surveys(CSUR), vol. 53, no. 2, pp. 1–37, 2020.
- P. Carbone, M. Fragkouliset al., “Beyond Analytics: The Evolution of Stream Processing Systems,” in SIGMOD. ACM, 2020, pp. 2651–2658.
- H. Isah, T. Abughofa, S. Mahfuzet al., “A Survey of Distributed Data Stream Processing Frameworks,” IEEE Access, vol. 7, 2019.
- M. Dayarathna and S. Perera, “Recent Advancements in Event Processing,” ACM Computing Surveys (CSUR), vol. 51, no. 2, p. 33, 2018.
Herodotos Herodotou is an Assistant Professor at the Cyprus University of Technology. His research work focuses on automatic performance tuning of both centralized and distributed data-intensive computing systems. His Ph.D. dissertation on MapReduce performance tuning received the ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention.
Lambros Odysseos is a Ph.D. student at the Cyprus University of Technology. He has been working as a research associate for the past few years and his research interests include stream processing, data analytics and visualizations, smart data processing, Internet of Things (IoT), and machine learning.
Yuxing Chen is a senior engineer at the Tencent Inc. His research topics are parameter tuning on big data systems and transaction processing.
Jiaheng Lu is a Professor at the University of Helsinki, Finland. He has written four books on Hadoop and NoSQL databases, and more than 100 journal and conference papers published in SIGMOD, VLDB, TODS, TKDE, etc.