Distributed Data-Parallel Platforms Revolutionize Multi-Way Join Queries, Slashing Communication Costs
The study focuses on making multi-way join queries more efficient in big data analysis. The researchers developed AutoMJ, a smart framework that automatically selects the best join strategy based on estimated intermediate result sizes. By testing AutoMJ on Apache Spark, they found that for queries with large intermediate results, the one-round join method was up to 159.3 times faster than the traditional approach. On the other hand, for smaller results, the traditional method was 2.1 to 6.2 times faster. Their results show that AutoMJ's strategy selection model effectively picks the optimal join method, with very low estimation errors on datasets from Twitter and Wikidata.