What is Hadoop Adaptive Mapreduce

Abstract:In recent years, the database community has witnessed the advent and breakthrough of many new system designs like the famous Hadoop MapReduce or main memory databases like MonetDB, Hekaton, SAP Hana, and HyPer to solve the problems of "Big Data". The system architectures in this generation of emerging systems often differ radically from traditional relational databases. For database research, this trend creates new challenges to design and optimize data structures for those novel architectures. Premier candidates for innovations are index structures, which are traditionally among the most crucial performance factors in databases. In this thesis, we focus on efficient indexing methods for Hadoop Map Reduce and main memory databases. Our work consists of three independent parts that resulted from different research projects. In the first part, we introduce HAIL, a novel approach for efficient static and adaptive indexing in Hadoop MapReduce. We believe that efficient indexing becomes increasingly important in the context of very large data sets. HAIL combines very low index build times that are often even invisible for users, with significant runtime improvements for selective MapReduce jobs. We provide extensive experiments, and show that HAIL can improve job runtimes by up to 68x over Hadoop. In the second part of this thesis, we present an in-depth evaluation of the adaptive radix tree ART, a recent and very promising competitor in the domain of tree indexes for main memory databases. ART was reported by its inventors to be significantly faster than previous tree indexes and even competitive to hash tables. However, the original evaluation of ART did not consider Judy Arrays, which is, to the best of our knowledge, the first data structure introducing adaptivity to radix trees. Furthermore, the hash table used in the comparison with ART was just a textbook implementation of chained hashing and not a more sophisticated state-of-the-art hash tables. We provide an extended analysis and experimental evaluation of ART, including a detailed comparison to Judy Arrays, hashing via quadratic probing, and three variants of Cuckoo hashing. Our results give a more differentiated look on ART. In particular, we present striking conceptual similarities between ART and Judy Arrays and show that well-engineered hash tables can beat the lookup throughput of adaptive radix trees by up 6x. In the third part, motivated by our previous results, we take a closer look at hashing methods in main memory databases. We identify seven key factors that influence hashing performance, evaluate their impact, and discuss the implications on hashing in modern databases. Our study indicates that choosing the right hashing method and configuration can make an order of magnitude difference in insert and lookup performance. We also provide a guideline for practitioners on when to use which hashing method.
In the past few years, the database community has seen the emergence and breakthrough of many novel system designs whose overarching goal is to solve the problems of big data. Important examples of this are the famous Hadoop MapReduce or main memory databases such as MonetDB, Hekaton, SAP Hana and Hyper. The architectures in this generation of systems often differ fundamentally from the approaches of traditional relational databases. For database research, this development creates new challenges in connection with the design of suitable data structures and their optimization in this new context. The outstanding candidates for innovation include index structures, which have traditionally been one of the most important performance factors in databases. In this thesis we concentrate on efficient indexing methods for Hadoop MapReduce and main memory databases. Our contribution consists of three independent parts, each of which arose from different research projects in this area. In the first part we present HAIL as a new method for efficient static and adaptive indexing in Hadoop MapReduce, a framework for distributed data processing on large computer networks. We believe that efficient indexing is particularly important given the very large amounts of data in typical Hadoop systems. HAIL combines a very short indexing time, which mostly remains invisible to the user, with considerable improvements in runtime for selective MapReduce jobs. We present extensive experiments on this and show that HAIL improves job runtimes by up to 68 times compared to Hadoop. In the second part of this thesis we present a comprehensive analysis of ART, an adaptive radix tree, which is a very promising competitor in the field of index trees for main memory databases. The inventors of ART explain in their work that ART is significantly faster than previous index trees and even approaches the performance of hash tables. However, this original study lacks a comparison of ART with Judy arrays, which to our knowledge is the first adaptive radix tree. Furthermore, the results shown in connection with hash tables are only based on a comparison of ART with a simple implementation of chained hashing, instead of optimized hash tables based on the latest technology. We provide an expanded study of ART, including detailed comparisons with Judy arrays, with quadratic probing, and with three varieties of cuckoo hashing. Our results allow a more differentiated assessment of ART. In particular, we also explain the striking conceptual similarities between ART and Judy arrays. Furthermore, our experimental results show that the throughput for search queries with hash tables is up to a factor of 6 higher than with adaptive radix trees. In the third part, to which our previous results motivated us, we take a closer look at hashing methods for main memory databases. We identify seven key factors that affect the performance of hash tables. In addition, we discuss the effects of these factors and the consequences for hashing in modern databases. Our study clearly shows that choosing the right hash algorithms and correctly configuring them can make an order of magnitude difference in performance for insert and search operations. Finally, we also offer a practical guide to help you find the best hashing method for a specific problem.