Co-Clustering the Documents and Words Using K-Means Clustering

IJCSEC Front Page
Abstract:
In this paper, we propose a novel constrained coclustering method to achieve two goals. First, we combine information theoretic coclustering and constrained clustering to improve clustering performance. Second, we adopt both supervised and unsupervised constraints to demonstrate the effectiveness of our algorithm. The unsupervised constraints are automatically derived from existing knowledge sources, thus saving the effort and cost of using manually labeled constraints. To achieve our first goal, we develop a two-sided hidden Markov random field (HMRF) model to represent both document and word constraints. We then use an alternating expectation maximization (EM) algorithm to optimize the model. We also propose two novel methods to automatically construct and incorporate document and word constraints to support unsupervised constrained clustering: 1) automatically construct document constraints based on overlapping named entities (NE) extracted by an NE extractor; 2) automatically construct word constraints based on their semantic distance inferred from 20 Newsgroups Data set. The results of our evaluation using 20 Newsgroups demonstrate the superiority of our approaches against the k-Means similarity approaches.

Keywords:Datamining, Clustering ,K-means clustering and Pattern set mining

References:

  1. A. Jain, M. Murty, and P. Flynn, “Data Clustering: A Review,”ACM Computing Surveys,1999.
  2. Y. Cheng and G.M. Church, “Biclustering of Expression Data,”Proc. Int’l System for Molecular Biology Conf. (ISMB)
  3. I.S. Dhillon, “Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning,” Proc. Seventh ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD)
  4. I.S. Dhillon, S. Mallela, and D.S. Modha, “Information-Theoretic Co-Clustering,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD)
  5. H. Cho, I.S. Dhillon, Y. Guan, and S. Sra, “Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data,”Conf. Data. Mining (SDM), 2004.
  6. Semi-Supervised Learning, O. Chapelle, B. Scho¨lkopf, and A. Zien,eds. MIT Press, http://www.kyb.tuebingen.mpg.de/ssl-book,2006.
  7. S. Basu, I. Davidson, and K. Wagstaff, Constrained Clustering:Advances in Algorithms, Theory, and Applications. Chapman & Hall / CRC, 2008.
  8. R.G. Pensa and J.-F. Boulicaut, “Constrained Co-Clustering of Gene Expression Data,” Proc. SIAM Int’l Conf. Data Mining (SDM),pp. 25-36, 2008.
  9. F. Wang, T. Li, and C. Zhang, “Semi-Supervised Clustering via Matrix Factorization,” Proc. SIAM Int’l Conf. Data. Mining (SDM),pp. 1-12, 2008.
  10. Y. Chen, L. Wang, and M. Dong, “Non-Negative Matrix Factorization for Semi-Supervised Heterogeneous Data Co-Clustering,” IEEE Trans. Knowledge and Data Eng., vol. 22, no. 10,pp. 1459-1474, Oct. 2010.
  11. A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D.S. Modha, “A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximation,” J. Machine Learning Research, vol. 8, pp. 1919-1986, 2007.