Accelerating Iterative Big Data Computing Through MPI
Fan Liang Xiaoyi Lu
In this paper, we first analyze the overhead of shuffle operation in Hadoop and Spark when running PageRank workload, and then propose an event-driven pipeline and in-memory shuffle design with better overlapping of computation and communication as DataMPIIteration, an MPI-based library, for iterative big data computing. Our performance evaluation shows DataMPI-Iteration can achieve 9X~21X speedup over Apache Hadoop, and 2X~3X speedup over Apache Spark for PageRank and K-means.
Accelerating Iterative Big Data Computing Through