Analisis Korelasi Pada Data Yahoo! Properties dan Instant Messaging dengan Menggunakan Hadoop

Mikiavonty Endrawati Mirabel, Henry Novianus Palit, Andreas Handojo


In the past few years, Big Data has been a trend in the world of Information Technology. Alongside the growth of organizations, the bigger the data that is owned, some can reach the scale of Terabytes to Pentabytes. When an organization wants to do an analysis of a Big Data it takes time due to limited CPU and memory. To overcome this, there is a paradigm which is called distributed computing. There are many tools that aim to cultivate Big Data such as Apache Hadoop, Apache Spark, and etc. However, the tool that is going to be analyzed is the Apache Hadoop.

The adoption rate of Big Data in Indonesia is 20% for 2 to 3 years into the future [1]. Seeing these facts, more and more companies are planning to adopt Big Data to analyze their data. Due to the increasing use of Big Data and Apache Hadoop, it is conducted an exploratory analysis of data correlation on Apache Hadoop. In addition, testing was also done using the Apache Hadoop with varying number of nodes, mappers and reducers, and number of different block sizes [10].

Exploratory analysis of data correlation on Apache Hadoop is done by making four types of data analysis applications, namely two applications correlation values for the data search Yahoo! Messenger and two applications for creating classification trees. Based on test results obtained that the R application is more suitable for smaller data size while Hadoop is more suitable for large data size. For large data, application R uses higher percentage of CPU and memory than Hadoop. The combination of mapper and reducer that will provide the most optimal execution time is for mapper in the range of 2/3 of the total CPU cores, while the reducer is 1/3 of the total CPU cores.


Hadoop; Big Data; Mapreduce; Correlation Analysis

Full Text:



Aggarwal, A. 2015. Managing Big Data Integration in the Public Sector. IGI Global.

Apache. 2016. Apache Hadoop 2.7.2 @2013; HDFS Architecture. URI =

Gravetter, F. J., & Wallnau, L. B. 2013. Statistics for the Behavioral Sciences. Canada: Jon-David Hague.

Harshawardhan S. Bhosale, P. D. 2014. A Review Paper on Big Data and Hadoop. International Journal of Scientific and Research Publications, 4(10), 1-7.

Kshemkalyani, A. D., & Singhal, M. 2011. Distributed Computing: Principles, Algorithms, and Systems. New York: Cambridge University Press.

Leszek, R., Maciej, J., & Lena, P. 2014. Classification and Regression Trees (CART) Theory and Applications. Information Sciences, 266, 1-15.

Loh, W.-Y. 2011. Classification and regression trees. WIREs Data Mining and Knowledge Discovery, 14-23.

Lublinsky, B., Smith, K. T., & Yakubovich, A. 2013. Professional Hadoop® Solutions. Indianapolis: John Wiley & Sons, Inc.

Maitreya, S., & Jhab, C. 2015. MapReduce: Simplified Data Analysis of Big. Procedia Computer Science 57, 563 – 571.

Marr, B. 2015. Why only one of the 5 Vs of big data really matters | IBM Big Data Analytics and Hub. URI=

Turkington, G. 2013. Hadoop Beginner's Guide. Birmingham: Packt Publishing Ltd.

Yahoo! About Us | URI =


  • There are currently no refbacks.