From patterns to pathways: gene expression data analysis comes of age

Many different biological questions are routinely studied using transcriptional profiling on microarrays. A wide range of approaches are available for gleaning insights from the data obtained from such experiments. The appropriate choice of data-analysis technique depends both on the data and on the goals of the experiment. This review summarizes some of the common themes in microarray data analysis, including detection of differential expression, clustering, and predicting sample characteristics. Several approaches to each problem, and their relative merits, are discussed and key areas for additional research highlighted.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

206,07 € per year

only 17,17 € per issue

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

How to do quantile normalization correctly for gene expression data analyses

Article Open access 23 September 2020

reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization

Article Open access 06 December 2021

A multivariate statistical test for differential expression analysis

Article Open access 18 May 2022

References

  1. Chu, S. et al. The transcriptional program of sporulation in budding yeast. Science282, 699–705 (1998). ArticleCASPubMedGoogle Scholar
  2. DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science278, 680–686 (1997). ArticleCASPubMedGoogle Scholar
  3. Schena, M. et al. Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl Acad. Sci. USA93, 10614–10619 (1996). ArticleCASPubMedPubMed CentralGoogle Scholar
  4. Wodicka, L., Dong, H., Mittmann, M., Ho, M.H. & Lockhart, D.J. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat. Biotechnol.15, 1359–1367 (1997). ArticleCASPubMedGoogle Scholar
  5. Churchill, G.A. Fundamentals of experimental design for cDNA microarrays. Nature Genet.32, 490–495 (2002). ArticleCASPubMedGoogle Scholar
  6. Yang, Y.H. & Speed, T. Design issues for cDNA microarray experiments. Nature Rev. Genet.3, 579–588 (2002). ArticleCASPubMedGoogle Scholar
  7. Fambrough, D., McClure, K., Kazlauskas, A. & Lander, E.S. Diverse signaling pathways activated by growth factor receptors induce broadly overlapping, rather than independent, sets of genes. Cell97, 727–741 (1999). ArticleCASPubMedGoogle Scholar
  8. Holstege, F.C. et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell95, 717–728 (1998). ArticleCASPubMedGoogle Scholar
  9. Li, C. & Hung Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol.2, research0032 (2001).
  10. Roberts, C.J. et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science287, 873–880 (2000). ArticleCASPubMedGoogle Scholar
  11. Ideker, T., Thorsson, V., Siegel, A.F. & Hood, L.E. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol.7, 805–817 (2000). ArticleCASPubMedGoogle Scholar
  12. Zar, J.H. Biostatistical Analysis, 663 (Prentice-Hall, Upper Saddle River, NJ, 1999). Google Scholar
  13. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA98, 5116–5121 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
  14. Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science286, 531–537 (1999). ArticleCASPubMedGoogle Scholar
  15. Model, F., Adorjan, P., Olek, A. & Piepenbrock, C. Feature selection for DNA methylation based cancer classification. Bioinformatics17 Suppl 1, S157–S164 (2001). ArticlePubMedGoogle Scholar
  16. Zhan, F. et al. Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. Blood99, 1745–1757 (2002). ArticleCASPubMedGoogle Scholar
  17. Ben-Dor, A., Friedman, N. & Yakhini, Z. Scoring genes for relevance. Technical Report 2000-38 (Institute of Computer Science, Hebrew University, Jerusalem, 2000).
  18. Park, P.J., Pagano, M. & Bonetti, M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput. 52–63 (2001).
  19. Quackenbush, J. Microarray data normalization and transformation. Nature Genet.32, 496–501 (2002). ArticleCASPubMedGoogle Scholar
  20. Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000). Google Scholar
  21. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat.6, 65–70 (1979). Google Scholar
  22. Westfall, P.H. & Young, S.S. Resampling-Based Multiple Testing, 340 (John Wiley & Sons, New York, 1993). Google Scholar
  23. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B57, 289–300 (1995). Google Scholar
  24. Chatfield, C. The Analysis of Time Series: An Introduction (5th ed.), 283 (Chapman & Hall, London, 1996). Google Scholar
  25. Shumway, R.H. & Stoffer, D.S. Time Series Analysis and Its Applications, 560 (Springer Verlag, New York, 2000). BookGoogle Scholar
  26. Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA95, 14863–14868 (1998). ArticleCASPubMedPubMed CentralGoogle Scholar
  27. Cho, R.J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell2, 65–73 (1998). ArticleCASPubMedGoogle Scholar
  28. Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell9, 3273–3297 (1998). ArticleCASPubMedPubMed CentralGoogle Scholar
  29. Aach, J. & Church, G.M. Aligning gene expression time series with time warping algorithms. Bioinformatics17, 495–508 (2001). ArticleCASPubMedGoogle Scholar
  30. Filkov, V., Skiena, S. & Zhi, J. Analysis techniques for microarray time-series data. J. Comput. Biol.9, 317–330 (2002). ArticleCASPubMedGoogle Scholar
  31. Raychaudhuri, S., Stuart, J.M. & Altman, R.B. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455–466 (2000).
  32. Landgrebe, J., Wurst, W. & Welzl, G. Permutation-validated principal components analysis of microarray data. Genome Biol.3, research0019 (2002).
  33. Holter, N.S. et al. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Natl Acad. Sci. USA97, 8409–8414 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
  34. Alter, O., Brown, P.O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA97, 10101–10106 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
  35. Bittner, M. et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature406, 536–540 (2000). ArticleCASPubMedGoogle Scholar
  36. Khan, J. et al. Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res.58, 5009–5013 (1998). CASPubMedGoogle Scholar
  37. Jain, A.K. & Dubes, R.C. Algorithms for Clustering Data (Prentice-Hall, Englewood Cliffs, NJ, 1988). Google Scholar
  38. Wen, X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA95, 334–339 (1998). ArticleCASPubMedPubMed CentralGoogle Scholar
  39. Alizadeh, A.A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature403, 503–511 (2000). ArticleCASPubMedGoogle Scholar
  40. Yona, G. Methods for global organization of all known protein sequences. PhD. thesis (Institute of Computer Science, Hebrew University, Jerusalem, Israel, 1999).
  41. Kohonen, T. Self-Organizing Maps (Springer, Berlin, 1997). BookGoogle Scholar
  42. Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA96, 2907–2912 (1999). ArticleCASPubMedPubMed CentralGoogle Scholar
  43. Ben-Dor, A., Shamir, R. & Yakhini, Z. Clustering gene expression patterns. J. Comput. Biol.6, 281–297 (1999). ArticleCASPubMedGoogle Scholar
  44. De Smet, F. et al. Adaptive quality-based clustering of gene expression profiles. Bioinformatics18, 735–746 (2002). ArticleCASPubMedGoogle Scholar
  45. Heyer, L.J., Kruglyak, S. & Yooseph, S. Exploring expression data: identification and analysis of coexpressed genes. Genome Res.9, 1106–1115 (1999). ArticleCASPubMedPubMed CentralGoogle Scholar
  46. Sharan, R. & Shamir, R. CLICK: a clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 307–316 (2000). CASPubMedGoogle Scholar
  47. Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E. & Ruzzo, W.L. Model-based clustering and data transformations for gene expression data. Bioinformatics17, 977–987 (2001). ArticleCASPubMedGoogle Scholar
  48. Fraley, C. & Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Amer. Stat. Assoc.97, 611–631 (2002). ArticleGoogle Scholar
  49. Hastie, T. et al. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol.1, research0003 (2000).
  50. Yeung, K.Y., Haynor, D.R. & Ruzzo, W.L. Validating clustering for gene expression data. Bioinformatics17, 309–318 (2001). ArticleCASPubMedGoogle Scholar
  51. McShane, L.M. et al. Methods of assessing reproducibility of clustering patterns observed in analysis of microarray data. Bioinformatics18, 1462–1469 (2002). ArticleCASPubMedGoogle Scholar
  52. Kerr, M.K. & Churchill, G.A. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA98, 8961–8965 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
  53. Gordon, A.D. Classification (Chapman & Hall/CRC, Boca Raton, FL, 1999). Google Scholar
  54. Ben-Hur, A., Elisseeff, A. & Guyon, I. A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 6–17 (2002).
  55. Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a dataset via the gap statistic. J. Roy. Statist. Soc. B63, 411–423 (2001). ArticleGoogle Scholar
  56. Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med.7, 673–679 (2001). ArticleCASPubMedGoogle Scholar
  57. Armstrong, S.A. et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genet.30, 41–47 (2002). ArticleCASPubMedGoogle Scholar
  58. Pomeroy, S.L. et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature415, 436–442 (2002). ArticleCASPubMedGoogle Scholar
  59. Sorlie, T. et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl Acad. Sci. USA98, 10869–10874 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
  60. van 't Veer, L.J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature415, 530–536 (2002). ArticleCASPubMedGoogle Scholar
  61. Chung, C.H., Bernard, P.S. & Perou, C.M. Molecular portraits and the family tree of cancer. Nature Genet.32, 533–540 (2002). ArticleCASPubMedGoogle Scholar
  62. Dudoit, S., Fridlyand, J. & Speed, T.P. Comparison of discrimination methods for the classification of tumors using gene expression data. Technical Report 576. (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).
  63. Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA99, 6567–6572 (2002). ArticleCASPubMedPubMed CentralGoogle Scholar
  64. Hedenfalk, I. et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med.344, 539–548 (2001). ArticleCASPubMedGoogle Scholar
  65. Ramaswamy, S. et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA98, 15149–15154 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
  66. Mitchell, T.M. Machine Learning, 414 (WCB McGraw-Hill, Boston, 1997). Google Scholar
  67. Califano, A., Stolovitzky, G. & Tu, Y. Analysis of gene expression microarrays for phenotype classification. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 75–85 (2000). CASPubMedGoogle Scholar
  68. Brown, M.P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA97, 262–267 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
  69. Furey, T.S. et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics16, 906–914 (2000). ArticleCASPubMedGoogle Scholar
  70. Breiman, L. Bagging predictors. Machine Learning24, 123–140 (1996). Google Scholar
  71. Schapire, R.E., Freund, Y., Bartlett, P. & Lee, W.S. Boosting the margin: a new explanation for the effectiveness of voting methods. Annls Stat.26, 1651–1686 (1998). ArticleGoogle Scholar
  72. Schapire, R.E. The strength of weak learnability. Machine Learning5, 197–227 (1990). Google Scholar
  73. Breiman, L. Manual on Setting Up, Using, and Understanding Random Forests v3.1. (University of California at Berkeley, Berkeley, CA, 2002). Google Scholar
  74. Shipp, M.A. et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Med.8, 68–74 (2002). ArticleCASPubMedGoogle Scholar
  75. Ben-Dor, A. et al. Tissue classification with gene expression profiles. J. Comput. Biol.7, 559–583 (2000). ArticleCASPubMedGoogle Scholar
  76. Su, A.I. et al. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res.61, 7388–7393 (2001). CASPubMedGoogle Scholar
  77. Bo, T. & Jonassen, I. New feature subset selection procedures for classification of expression profiles. Genome Biol.3, research0017 (2002).
  78. Butte, A.J. & Kohane, I.S. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 418–429 (2000).
  79. Liang, S., Fuhrman, S. & Somogyi, R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 18–29 (1998).
  80. Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol.7, 601–620 (2000). ArticleCASPubMedGoogle Scholar
  81. Ideker, T.E., Thorsson, V. & Karp, R.M. Discovery of regulatory interactions through perturbation: inference and experimental design. Pac. Symp. Biocomput. 305–316 (2000).
  82. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S. & Young, R.A. Combining location and expression data for principled discovery of genetic regulatory network models. Pac. Symp. Biocomput. 437–449 (2002).
  83. Pe'er, D., Regev, A., Elidan, G. & Friedman, N. Inferring subnetworks from perturbed expression profiles. Bioinformatics17 Suppl 1, S215–S224 (2001). ArticlePubMedGoogle Scholar
  84. Segal, E., Taskar, B., Gasch, A., Friedman, N. & Koller, D. Rich probabilistic models for gene expression. Bioinformatics17 Suppl 1, S243–S252 (2001). ArticlePubMedGoogle Scholar
  85. Yoo, C., Thorsson, V. & Cooper, G.F. Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. Pac. Symp. Biocomput. 498–509 (2002).
  86. Hartemink, A.J., Gifford, D.K., Jaakkola, T.S. & Young, R.A. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput. 422–433 (2001).
  87. Potter, J.D. At the interfaces of epidemiology, genetics and genomics. Nature Rev. Genet.2, 142–147 (2001). ArticleCASPubMedGoogle Scholar
  88. Kohane, I.S. Bioinformatics and clinical informatics: the imperative to collaborate. J. Am. Med. Inform. Assoc.7, 512–516 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
  89. Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R. & Kohane, I.S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA97, 12182–12186 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
  90. Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A.F. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics18 Suppl 1, S233–S240 (2002). ArticlePubMedGoogle Scholar
  91. Chiang, D.Y., Brown, P.O. & Eisen, M.B. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics17 Suppl 1, S49–S55 (2001). ArticlePubMedGoogle Scholar
  92. Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J. & Church, G.M. Systematic determination of genetic network architecture. Nature Genet.22, 281–285 (1999). ArticleCASPubMedGoogle Scholar
  93. Holmes, I. & Bruno, W.J. Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 202–210 (2000). CASPubMedGoogle Scholar
  94. Shatkay, H., Edwards, S., Wilbur, W.J. & Boguski, M. Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 317–328 (2000). CASPubMedGoogle Scholar
  95. Masys, D.R. et al. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics17, 319–326 (2001). ArticleCASPubMedGoogle Scholar
  96. Jenssen, T.K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet.28, 21–28 (2001). CASPubMedGoogle Scholar
  97. Staunton, J.E. et al. Chemosensitivity prediction by transcriptional profiling. Proc. Natl Acad. Sci. USA98, 10787–10792 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
  98. Radmacher, M.D., McShane, L.M. & Simon, R. A paradigm for class prediction using gene expression profiles. J. Comput. Biol.9, 505–511 (2002). ArticleCASPubMedGoogle Scholar

Acknowledgements

I thank Gene Brown, Lenore Cowen, Steve Haney, Andrew Hill, Steve Rozen and Timm Triplett for helpful discussions and comments.