Aug 3, 2022

Conventional database management vs. Hadoop data systems

 Without going into the definitions of the traditional and Hadoop data management systems, the following are the top highlighted points of comparison between these two systems based on the parallel processing effect.

Hadoop architecture is designed based on parallel and distributed processing systems. Each feature selector of Hadoop can work separately into subtasks, and the subtasks can then be working and processed parallel. (Hodge, 2016) Multifeatured selectors in Hadoop also can be processed in parallel, allowing multi-feature selectors to be compared, and this is unique for Hadoop, whereas the conventional RDBMS does not have this feature. 

In the field of data mining, Hadoop or MapReduce algorithms run on a parallel processing system (parallelized) for individual feature selection, which enables Hadoop to analyze a vast scale of data mining. But other data mining tools such as Weka, Matlab, and SPSS are designed for small-scale data mining because of how their multiprocessing algorithms are developed.

YARN in Hadoop is highly configurable and can assign work to the nodes in a cluster where the Hadoop Distributed File System (HDFS) spans all the nodes in the Hadoop cluster with just a single namespace. YARN can run any existing application by superseding MapReduce in Hadoop, and this ability makes the Hadoop system a vast scale of parallel data processing. An ApplicationMaster controls the node management and cluster management in Hadoop. Its job is to negotiate resources from the central resource manager and nade manager agents to monitor and execute the works. On the other hand, the MapReduce procedures map the parallel processing in separate chunks, combine the results, and issue a signle output. So, the inputs and outputs are stored in the HDFS for future analysis.  HDFS in Hadoop uses parallel processing to link all file systems on many local nodes into an extensive file system for future analysis. This linking between resources in Hadoop creates a highly aggregate bandwidth across the cluster.

So, with all said, RDBMS is used for data storage, manipulation, and retrieval. In contrast, Hadoop is an open-source multiprocessing-based system architecture for storing data and running applications and processes parallel. (GeeksforGeeks, 2022) Therefore, from my point of view, Hadoop is a more highly parallel processing-based system than the conventional RDBMS.


Reference

GeeksforGeeks. (2022, February 25). Difference Between RDBMS and Hadoop. Retrieved 2022, from https://www.geeksforgeeks.org/difference-between-rdbms-and-hadoop/

Hodge, V. J., O’Keefe, S., & Austin, J. (2016). Hadoop neural network for parallel and distributed feature selection. In Neural Networks (Vol. 78, pp. 24–35). Elsevier BV. https://doi.org/10.1016/j.neunet.2015.08.011

No comments:

Post a Comment

Big Data migrates to hybrid and multi-cloud environment

 IDC research predicts that the Global Datasphere will grow to 175 Zettabytes by 2025, and China's data sphere is on pace to become th...