Frequent Pattern Mining in Big Data
Author(s):
Nagaraju.T,Subhashini.P
Year of Publication:
2015
International Journal of Computer Science and Engineering Communications
Abstract
Frequent Pattern Mining (FPM) is one of the most well-known techniques to extract knowledge from data. The combinatorial explosion of FIM methods become even more problematic when they are applied to Big Data. Fortunately, recent improvements in the field of parallel programming already provide good tools to tackle this problem. However, these tools come with their own technical challenges, e.g. balanced data distribution and inter-communication costs. In this paper, investigation applicability of FIM(Frequent Item-set Mining) techniques on the Map Reduce platform. The introduce two new methods for mining large datasets: Dist-Eclat focuses on speed while BigFIM is optimized to run on really large datasets. In our experiments show the scalability of our methods.
Mining frequent itemsets is one of the most investigated fields in data mining. It is a fundamental and crucial task. In order to improve the efficiency and processing the data parallel Map reduce algorithm is used, for mining frequent itemsets, is proposed. Firstly, the data structure binary string is employed to describe the database. Hadoop is used to process the big data parallel using Map reduce. Large files are stored in the Hadoop file system and processing the input files for finding the frequent patterns in the given input files. Here the system, which acts as a source can produce structured noise of its own, and hence the dependency on helpers may get reduced.
Keywords:FPM,FIM,Map Reduce, Hadoop
INTRODUCTION
Frequent Pattern Mining (FPM) is one of the most well-known techniques to extract knowledge from data. The combinatorial explosion of FIM methods become even more problematic when they are applied to Big Data. Fortunately, recent improvements in the field of parallel programming already provide good tools to tackle this problem. However, these tools come with their own technical challenges, e.g. balanced data distribution and inter-communication costs. In this paper, investigating the applicability of FIM techniques on the Map Reduce platform. introduced two new methods for mining large datasets: Dist-Eclat focuses on speed while BigFIM is optimized to run on really large datasets. In our experiments show the scalability of our methods.The data in Hadoop Distributed File System is scattered and needs lots of time to retrieve. MapReduce function on data sets of key & value pair is the programming paradigm of large distributed operation. The proposed work aims to minimize the data retrieval time taken by the MapReduce program in the HDFS. The major idea is to use in-memory caching in the map phase which shall give a fast and efficient way of searching frequent item set in MapReduce paradigm. For real time processing on Hadoop, a search mechanism is implemented in HDFS. The approach is used to improve its availability, performance and scalability for searching frequent item set.