¡¡Chinese Journal of Computers   Full Text
  TitleHeuristic Breadth-First Search Algorithm for Informative Gene Selection Based on Gene Expression Profiles
  AuthorsWANG Shu-Lin1),2) WANG Ji1) CHEN Huo-Wang1) LI Shu-Tao3) ZHANG Bo-Yun1)
  Address1)(School of Computer Science, National University of Defense Technology, Changsha 410073)
2)(School of Computer and Communication, Hunan University, Changsha 410082)
3)(College of Electrical and Information Engineering, Hunan University, Changsha 410082)
  Year2008
  IssueNo.4(636¡ª649)
  Abstract &
  Background
Abstract The tumor diagnosis method based on gene expression profiles will be developed into the fast and effective method in clinical domain in the near future. Although DNA microarray experiments provide us with huge amount of gene expression data, only a few of genes are related to tumor in gene expression profiles. Moreover, it is difficult to select informative genes related to tumor from gene expression profiles because of its characteristics such as high dimensionality, small sample set and many noises in gene expression profiles. According to its characteristic, a novel heuristic breadth-first search algorithm based on support vector machines is proposed, which can simultaneously find as many informative gene subsets as possible in which the number of informative genes is almost least but its classification performance is almost highest in spite of its time-consuming characteristic. Three tumor sample sets are examined by the novel approach and experiments show that the novel approach is feasible and effective in tumor classification. Experiment results show that 100% of 4-fold cross-validation accuracy has been achieved by only two, four and four genes for leukemia, colon tumor and SRBCT (Small Round Blue Cells Tumor) datasets, respectively, which is superior to the results of other tumor classification methods. To avoid the affect of different partition of sample set, the full-fold cross-validated method that can more objectively evaluate the classification performance of informative gene subset is proposed.

keywords gene expression profiles; tumor classification; informative gene selection; support vector machines; full-fold cross-validated method

background This research is supported by the Program for New Century Excellent Talents in University and the Excellent Youth Foundation of Hunan Province (Grant No. 06JJ1010) which aims to develop novel methods of feature extraction and gene selection to further improve the tumor subtype classification performance and finally attempts to find the optimum tumor classification model based on gene expression profiles. To achieve this purpose, the research group has published 19 relevant papers.
Although a large number of tumor classification methods have been proposed and proved effectively by experiments on tumor dataset, how many genes is minimum number for tumor classification and how many gene subsets are like this are still unknown due to the complexity of tumor development. The authors have designed a heuristic breadth-first search algorithm for finding the minimum cardinal number of informative gene subset with prior knowledge obtained by factor analysis and independent component analysis. Experiment results show that 3~4 informative genes are enough for tumor subtype classification with highly predictive accuracy. Therefore, this work has revealed the structure characteristics of tumor gene expression profiles, which is of great benefit to the exploration of tumor development mechanism and the discovery of regulatory tumor gene network.