An alternative framework for distributed computing tames the ever-increasing costs of big data

by Tsinghua University Press

Diagram showing the locations of various computational and other costs incurred in the proposed non-MapReduce framework. Image credit: Big Data Mining and Analytics, Tsinghua University Press

The sheer volume of “Big Data” being produced by various sectors today is beginning to overwhelm even the extremely efficient computing techniques that have been developed to sift through all this information. But a new computational framework based on random sampling seems finally to tame the ever-growing communication, storage, and energy costs of big data into something more manageable.

An article describing the framework was published in the journal Big Data Mining and Analytics.

The amount of data produced from social networking, business transactions, the “Internet of Things”, finance, healthcare and beyond has exploded in recent years. This era of so-called big data has offered incredible statistical power to discover patterns and provide insights previously unimaginable. But the volume of Big Data produced is gradually reaching the limits of computing power.

The scalability of complex algorithms in a computer cluster or in cloud computing falters at about a terabyte of data – or a trillion bytes. The New York Stock Exchange, for example, produces about a terabyte worth of trading data every day, while Facebook users generate 500 terabytes in the same amount of time.

Distributed computing plays a crucial role in storing, processing and analyzing such big data. This framework employs a “divide and conquer” strategy to sort it efficiently and quickly. This involves partitioning a large data file into a series of smaller files called “data block files”.

READ :  MUNIC EKKO Platform and Edge Computing OBD Dongles Power Smart Connect project of the ADAC (Allgemeiner Deutscher Automobil-Club)

These blocks of data are stored across the many nodes of a computer cluster. Each of these blocks is then processed in parallel rather than sequentially, radically speeding up processing time. The results from these local nodes are then fed back to a central location and re-integrated, producing a global result.

This divide-and-conquer operation is in turn managed by a distributed file system, which in turn is governed by a programming model. The file system divides the large data files, and the programming model divides an algorithm into parts that can then be executed across the data blocks.

MapReduce, developed by Google, is the most widely used programming model for distributed computing running on clusters and across the cloud. The name comes from its two basic operations. The map operation is performed on the data block in a node to produce a local result. This is executed on multiple nodes in parallel to achieve the huge speedup in processing time. The Reduce operation then combines all of these local results into one global result.

This latter stage involves transmission of local results to other master nodes or central nodes performing the reduce operation, and all this data shuffling is extremely expensive in terms of communication traffic and memory.

“These huge communication costs are manageable to a point,” said Xudong Sun, the paper’s lead author and a computer scientist at Shenzhen University’s College of Computer Science and Software Engineering. “If the desired task involves only a single pair of map and reduce operations, such as For example, counting the frequency of a word across a large number of web pages, MapReduce can be run extremely efficiently across thousands of nodes and even across a gigantic big data file.”

READ :  Cloud Project Portfolio Management Global Market Report 2022

“But if the desired task involves a series of iterations of the Map and Reduce pairs, then MapReduce becomes very sluggish due to the high communication costs and the resulting memory and computation costs,” he added.

So the researchers developed a new framework for distributed computing, which they call Non-MapReduce, to improve the scalability of cluster computing on big data by reducing those communication and storage costs.

To do this, they rely on a novel data representation model called Random Sample Partition, or RSP. This involves randomly sampling the distributed data blocks of a large data file instead of processing all distributed data blocks. When a big data file is analyzed, a set of RSP data blocks are randomly selected for processing and then integrated at a global level to get an approximation of the result if the entire data file had been processed.

In this way, the technique works similarly to statistical analysis, which uses random sampling to describe the characteristics of a population. The RSP approach of Non-MapReduce is thus a kind of so-called “approximate computing”, an emerging paradigm in data processing to achieve greater energy efficiency, which only gives an approximate and not an exact result.

Approximate calculations are useful in those situations where an approximately accurate result that is computationally inexpensive to achieve is sufficient for the task at hand and is superior to a computationally expensive attempt to provide a perfectly accurate result.

The Non-MapReduce computing framework will be of significant use for a number of tasks such as: execute a suite of algorithms directly on local random samples without requiring data communication between nodes; and facilitating big data exploration and cleansing. In addition, the framework saves considerable energy in cloud computing.

READ :  Skeptical investors worry if advances in AI will make money

The team now hopes to apply its non-MapReduce framework to some large big data platforms and use it for real-world applications. Ultimately, they want to use it to solve application problems when analyzing extremely large amounts of data that are distributed across multiple data centers.

More information: Xudong Sun et al., Survey of Distributed Computing Frameworks for Supporting Big Data Analysis, Big Data Mining and Analytics (2023). DOI: 10.26599/BDMA.2022.9020014

Provided by Tsinghua University Press