SLGP Header

Frequent Pattern Mining in Big Data

IJCSEC Front Page

Abstract
Frequent Pattern Mining (FPM) is one of the most well-known techniques to extract knowledge from data. The combinatorial explosion of FIM methods become even more problematic when they are applied to Big Data. Fortunately, recent improvements in the field of parallel programming already provide good tools to tackle this problem. However, these tools come with their own technical challenges, e.g. balanced data distribution and inter-communication costs. In this paper, investigation applicability of FIM(Frequent Item-set Mining) techniques on the Map Reduce platform. The introduce two new methods for mining large datasets: Dist-Eclat focuses on speed while BigFIM is optimized to run on really large datasets. In our experiments show the scalability of our methods. Mining frequent itemsets is one of the most investigated fields in data mining. It is a fundamental and crucial task. In order to improve the efficiency and processing the data parallel Map reduce algorithm is used, for mining frequent itemsets, is proposed. Firstly, the data structure binary string is employed to describe the database. Hadoop is used to process the big data parallel using Map reduce. Large files are stored in the Hadoop file system and processing the input files for finding the frequent patterns in the given input files. Here the system, which acts as a source can produce structured noise of its own, and hence the dependency on helpers may get reduced.
Keywords:FPM,FIM,Map Reduce, Hadoop
INTRODUCTION
Frequent Pattern Mining (FPM) is one of the most well-known techniques to extract knowledge from data. The combinatorial explosion of FIM methods become even more problematic when they are applied to Big Data. Fortunately, recent improvements in the field of parallel programming already provide good tools to tackle this problem. However, these tools come with their own technical challenges, e.g. balanced data distribution and inter-communication costs. In this paper, investigating the applicability of FIM techniques on the Map Reduce platform. introduced two new methods for mining large datasets: Dist-Eclat focuses on speed while BigFIM is optimized to run on really large datasets. In our experiments show the scalability of our methods.The data in Hadoop Distributed File System is scattered and needs lots of time to retrieve. MapReduce function on data sets of key & value pair is the programming paradigm of large distributed operation. The proposed work aims to minimize the data retrieval time taken by the MapReduce program in the HDFS. The major idea is to use in-memory caching in the map phase which shall give a fast and efficient way of searching frequent item set in MapReduce paradigm. For real time processing on Hadoop, a search mechanism is implemented in HDFS. The approach is used to improve its availability, performance and scalability for searching frequent item set.

References:

  1. Ming-Yen Lin, Pei-Yu Lee, and Sue-Chen Hsueh. 2012. Aprioribased frequent itemset mining algorithms on MapReduce.In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication (ICUIMC '12). ACM, New York, NY, USA, , Article 76 , 8 pages.
  2. Farzanyar, Z., & Cercone, N. (2013, August). Efficient Mining of Frequent itemsets in Social Network Data based on MapReduce Framework. In Proceedings of the 2013 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013) (pp. 1183-1188). IEEE Computer Society.
  3. Li L. & Zhang M. (2011). The Strategy of Mining Association Rule Based on Cloud Computing. Proceeding of the 2011 International Conference on Business Computing and Global Informatization (BCGIN ’11). Washington, DC, USA, IEEE: 475- 478.
  4. Li N., Zeng L., He Q. & Shi Z. (2012). Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc. of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD ’12). Kyoto, IEEE: 236 – 241.
  5. Yang X.Y., Liu Z. & Fu Y. (2010). MapReduce as a Programming Model for Association Rules Algorithm on Hadoop. Proc. of the 3rd International Conference on Information Sciences and Interaction Sciences (ICIS ’10). Chengdu, China, IEEE: 99 – 102.
  6. Othman Yahya, Osman Hegazy, Ehab Ezat (2012).An Efficient Implementation of Apriori Algorithm Based on Hadoop-Mapreduce Model, Proc. of the International Journal of Reviews in Computing 31st December 2012. Vol. 12:59-67.
  7. H. Li, Y. Wang, D. Zhang, M. Zhang, and E.Y. Chang: PFP: Parallel FP-Growth for Query Recommendation. In: Proceedings of the 2008 ACM Conference on Recommender Systems, pp. 107-114, 2008.
  8. L. Zhou, Z. Zhong, J. Chang, J. Li, J. Huang, and S. Feng. Balanced parallel FP-Growth with MapReduce. In Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, pages 243–246, nov. 2010. 598
  9. E. Ozkural, B. Ucar, and C. Aykanat. Parallel frequent item set mining with selective item replication. Parallel and Distributed Systems, IEEE Transactions on, 22(10):1632–1640, oct. 2011.
  10. Hadoop, http://hadoop.apache.org/