Big data system benchmarking enables practitioners and developers to assess the systems’ functionality and performance so that they can make wise decision to choose the proper big data systems, or improve them. As we are witnessing the emergence and evolvement of various benchmarks for big data systems, either in the form of macro-benchmark or micro-benchmark, it is crucial to thoroughly study, analyze, and understand the key techniques and applications of those benchmarks. In this tutorial, we offer a comprehensive presentation of a wide range of state-of-the-art benchmarks with a focus on big data systems. We classify these benchmarks into five categories: Map-Reduce based system benchmarking, SQL-based analytical system benchmarking, NoSQL-based database benchmarking, Big graph system benchmarking, and Multi-model database benchmarking. We discuss the key techniques of each approach, as well as the current practices. We also provide insights on the research challenges and directions for benchmarking different big data systems.

Related References:

Jiaheng Lu is a Professor at the University of Helsinki, Finland. His main research interests lie in big data management and database systems. He has published more than one hundred journals and conference papers. He has extensive experience in the industrial cooperations with IBM, Microsoft, and Huawei for the projects of NoSQL databases and performance tuning on distributed systems. He has published several books, on XML, Hadoop, and NoSQL databases. His book on Hadoop is one of the top-10 best-selling books in the category of computer software in China in 2013. He frequently serves as a PC member for conferences including SIGMOD, VLDB, ICDE, EDBT, CIKM, etc.



Chao Zhang is a senior Ph.D. candidate at the University of Helsinki (UH), Finland. Prior to joining UH, Chao spent one year at Renmin University of China (RUC) for Ph.D. studies. His research topic lies in multi-model database benchmarking and query optimization. He is the main contributor to the UniBench project that is the first benchmark for multi-model databases. He has published five journal and conference papers in the field of databases, with a focus on multi-model database benchmarking. 

