From patterns to pathways: gene expression data analysis comes of age
Many different biological questions are routinely studied using transcriptional profiling on microarrays. A wide range of approaches are available for gleaning insights from the data obtained from such experiments. The appropriate choice of data-analysis technique depends both on the data and on the goals of the experiment. This review summarizes some of the common themes in microarray data analysis, including detection of differential expression, clustering, and predicting sample characteristics. Several approaches to each problem, and their relative merits, are discussed and key areas for additional research highlighted.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
206,07 € per year
only 17,17 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
How to do quantile normalization correctly for gene expression data analyses
Article Open access 23 September 2020
reString: an open-source Python software to perform automatic functional enrichment retrieval, results aggregation and data visualization
Article Open access 06 December 2021
A multivariate statistical test for differential expression analysis
Article Open access 18 May 2022
References
- Chu, S. et al. The transcriptional program of sporulation in budding yeast. Science282, 699–705 (1998). ArticleCASPubMedGoogle Scholar
- DeRisi, J.L., Iyer, V.R. & Brown, P.O. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science278, 680–686 (1997). ArticleCASPubMedGoogle Scholar
- Schena, M. et al. Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl Acad. Sci. USA93, 10614–10619 (1996). ArticleCASPubMedPubMed CentralGoogle Scholar
- Wodicka, L., Dong, H., Mittmann, M., Ho, M.H. & Lockhart, D.J. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nat. Biotechnol.15, 1359–1367 (1997). ArticleCASPubMedGoogle Scholar
- Churchill, G.A. Fundamentals of experimental design for cDNA microarrays. Nature Genet.32, 490–495 (2002). ArticleCASPubMedGoogle Scholar
- Yang, Y.H. & Speed, T. Design issues for cDNA microarray experiments. Nature Rev. Genet.3, 579–588 (2002). ArticleCASPubMedGoogle Scholar
- Fambrough, D., McClure, K., Kazlauskas, A. & Lander, E.S. Diverse signaling pathways activated by growth factor receptors induce broadly overlapping, rather than independent, sets of genes. Cell97, 727–741 (1999). ArticleCASPubMedGoogle Scholar
- Holstege, F.C. et al. Dissecting the regulatory circuitry of a eukaryotic genome. Cell95, 717–728 (1998). ArticleCASPubMedGoogle Scholar
- Li, C. & Hung Wong, W. Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Genome Biol.2, research0032 (2001).
- Roberts, C.J. et al. Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science287, 873–880 (2000). ArticleCASPubMedGoogle Scholar
- Ideker, T., Thorsson, V., Siegel, A.F. & Hood, L.E. Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J. Comput. Biol.7, 805–817 (2000). ArticleCASPubMedGoogle Scholar
- Zar, J.H. Biostatistical Analysis, 663 (Prentice-Hall, Upper Saddle River, NJ, 1999). Google Scholar
- Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA98, 5116–5121 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
- Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science286, 531–537 (1999). ArticleCASPubMedGoogle Scholar
- Model, F., Adorjan, P., Olek, A. & Piepenbrock, C. Feature selection for DNA methylation based cancer classification. Bioinformatics17 Suppl 1, S157–S164 (2001). ArticlePubMedGoogle Scholar
- Zhan, F. et al. Global gene expression profiling of multiple myeloma, monoclonal gammopathy of undetermined significance, and normal bone marrow plasma cells. Blood99, 1745–1757 (2002). ArticleCASPubMedGoogle Scholar
- Ben-Dor, A., Friedman, N. & Yakhini, Z. Scoring genes for relevance. Technical Report 2000-38 (Institute of Computer Science, Hebrew University, Jerusalem, 2000).
- Park, P.J., Pagano, M. & Bonetti, M. A nonparametric scoring algorithm for identifying informative genes from microarray data. Pac. Symp. Biocomput. 52–63 (2001).
- Quackenbush, J. Microarray data normalization and transformation. Nature Genet.32, 496–501 (2002). ArticleCASPubMedGoogle Scholar
- Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000). Google Scholar
- Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat.6, 65–70 (1979). Google Scholar
- Westfall, P.H. & Young, S.S. Resampling-Based Multiple Testing, 340 (John Wiley & Sons, New York, 1993). Google Scholar
- Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B57, 289–300 (1995). Google Scholar
- Chatfield, C. The Analysis of Time Series: An Introduction (5th ed.), 283 (Chapman & Hall, London, 1996). Google Scholar
- Shumway, R.H. & Stoffer, D.S. Time Series Analysis and Its Applications, 560 (Springer Verlag, New York, 2000). BookGoogle Scholar
- Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA95, 14863–14868 (1998). ArticleCASPubMedPubMed CentralGoogle Scholar
- Cho, R.J. et al. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell2, 65–73 (1998). ArticleCASPubMedGoogle Scholar
- Spellman, P.T. et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell9, 3273–3297 (1998). ArticleCASPubMedPubMed CentralGoogle Scholar
- Aach, J. & Church, G.M. Aligning gene expression time series with time warping algorithms. Bioinformatics17, 495–508 (2001). ArticleCASPubMedGoogle Scholar
- Filkov, V., Skiena, S. & Zhi, J. Analysis techniques for microarray time-series data. J. Comput. Biol.9, 317–330 (2002). ArticleCASPubMedGoogle Scholar
- Raychaudhuri, S., Stuart, J.M. & Altman, R.B. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput. 455–466 (2000).
- Landgrebe, J., Wurst, W. & Welzl, G. Permutation-validated principal components analysis of microarray data. Genome Biol.3, research0019 (2002).
- Holter, N.S. et al. Fundamental patterns underlying gene expression profiles: simplicity from complexity. Proc. Natl Acad. Sci. USA97, 8409–8414 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
- Alter, O., Brown, P.O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA97, 10101–10106 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
- Bittner, M. et al. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature406, 536–540 (2000). ArticleCASPubMedGoogle Scholar
- Khan, J. et al. Gene expression profiling of alveolar rhabdomyosarcoma with cDNA microarrays. Cancer Res.58, 5009–5013 (1998). CASPubMedGoogle Scholar
- Jain, A.K. & Dubes, R.C. Algorithms for Clustering Data (Prentice-Hall, Englewood Cliffs, NJ, 1988). Google Scholar
- Wen, X. et al. Large-scale temporal gene expression mapping of central nervous system development. Proc. Natl Acad. Sci. USA95, 334–339 (1998). ArticleCASPubMedPubMed CentralGoogle Scholar
- Alizadeh, A.A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature403, 503–511 (2000). ArticleCASPubMedGoogle Scholar
- Yona, G. Methods for global organization of all known protein sequences. PhD. thesis (Institute of Computer Science, Hebrew University, Jerusalem, Israel, 1999).
- Kohonen, T. Self-Organizing Maps (Springer, Berlin, 1997). BookGoogle Scholar
- Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl Acad. Sci. USA96, 2907–2912 (1999). ArticleCASPubMedPubMed CentralGoogle Scholar
- Ben-Dor, A., Shamir, R. & Yakhini, Z. Clustering gene expression patterns. J. Comput. Biol.6, 281–297 (1999). ArticleCASPubMedGoogle Scholar
- De Smet, F. et al. Adaptive quality-based clustering of gene expression profiles. Bioinformatics18, 735–746 (2002). ArticleCASPubMedGoogle Scholar
- Heyer, L.J., Kruglyak, S. & Yooseph, S. Exploring expression data: identification and analysis of coexpressed genes. Genome Res.9, 1106–1115 (1999). ArticleCASPubMedPubMed CentralGoogle Scholar
- Sharan, R. & Shamir, R. CLICK: a clustering algorithm with applications to gene expression analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 307–316 (2000). CASPubMedGoogle Scholar
- Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E. & Ruzzo, W.L. Model-based clustering and data transformations for gene expression data. Bioinformatics17, 977–987 (2001). ArticleCASPubMedGoogle Scholar
- Fraley, C. & Raftery, A.E. Model-based clustering, discriminant analysis, and density estimation. J. Amer. Stat. Assoc.97, 611–631 (2002). ArticleGoogle Scholar
- Hastie, T. et al. 'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol.1, research0003 (2000).
- Yeung, K.Y., Haynor, D.R. & Ruzzo, W.L. Validating clustering for gene expression data. Bioinformatics17, 309–318 (2001). ArticleCASPubMedGoogle Scholar
- McShane, L.M. et al. Methods of assessing reproducibility of clustering patterns observed in analysis of microarray data. Bioinformatics18, 1462–1469 (2002). ArticleCASPubMedGoogle Scholar
- Kerr, M.K. & Churchill, G.A. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci. USA98, 8961–8965 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
- Gordon, A.D. Classification (Chapman & Hall/CRC, Boca Raton, FL, 1999). Google Scholar
- Ben-Hur, A., Elisseeff, A. & Guyon, I. A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 6–17 (2002).
- Tibshirani, R., Walther, G. & Hastie, T. Estimating the number of clusters in a dataset via the gap statistic. J. Roy. Statist. Soc. B63, 411–423 (2001). ArticleGoogle Scholar
- Khan, J. et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Med.7, 673–679 (2001). ArticleCASPubMedGoogle Scholar
- Armstrong, S.A. et al. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genet.30, 41–47 (2002). ArticleCASPubMedGoogle Scholar
- Pomeroy, S.L. et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature415, 436–442 (2002). ArticleCASPubMedGoogle Scholar
- Sorlie, T. et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl Acad. Sci. USA98, 10869–10874 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
- van 't Veer, L.J. et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature415, 530–536 (2002). ArticleCASPubMedGoogle Scholar
- Chung, C.H., Bernard, P.S. & Perou, C.M. Molecular portraits and the family tree of cancer. Nature Genet.32, 533–540 (2002). ArticleCASPubMedGoogle Scholar
- Dudoit, S., Fridlyand, J. & Speed, T.P. Comparison of discrimination methods for the classification of tumors using gene expression data. Technical Report 576. (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).
- Tibshirani, R., Hastie, T., Narasimhan, B. & Chu, G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA99, 6567–6572 (2002). ArticleCASPubMedPubMed CentralGoogle Scholar
- Hedenfalk, I. et al. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med.344, 539–548 (2001). ArticleCASPubMedGoogle Scholar
- Ramaswamy, S. et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA98, 15149–15154 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
- Mitchell, T.M. Machine Learning, 414 (WCB McGraw-Hill, Boston, 1997). Google Scholar
- Califano, A., Stolovitzky, G. & Tu, Y. Analysis of gene expression microarrays for phenotype classification. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 75–85 (2000). CASPubMedGoogle Scholar
- Brown, M.P. et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA97, 262–267 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
- Furey, T.S. et al. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics16, 906–914 (2000). ArticleCASPubMedGoogle Scholar
- Breiman, L. Bagging predictors. Machine Learning24, 123–140 (1996). Google Scholar
- Schapire, R.E., Freund, Y., Bartlett, P. & Lee, W.S. Boosting the margin: a new explanation for the effectiveness of voting methods. Annls Stat.26, 1651–1686 (1998). ArticleGoogle Scholar
- Schapire, R.E. The strength of weak learnability. Machine Learning5, 197–227 (1990). Google Scholar
- Breiman, L. Manual on Setting Up, Using, and Understanding Random Forests v3.1. (University of California at Berkeley, Berkeley, CA, 2002). Google Scholar
- Shipp, M.A. et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Med.8, 68–74 (2002). ArticleCASPubMedGoogle Scholar
- Ben-Dor, A. et al. Tissue classification with gene expression profiles. J. Comput. Biol.7, 559–583 (2000). ArticleCASPubMedGoogle Scholar
- Su, A.I. et al. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res.61, 7388–7393 (2001). CASPubMedGoogle Scholar
- Bo, T. & Jonassen, I. New feature subset selection procedures for classification of expression profiles. Genome Biol.3, research0017 (2002).
- Butte, A.J. & Kohane, I.S. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac. Symp. Biocomput. 418–429 (2000).
- Liang, S., Fuhrman, S. & Somogyi, R. Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac. Symp. Biocomput. 18–29 (1998).
- Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol.7, 601–620 (2000). ArticleCASPubMedGoogle Scholar
- Ideker, T.E., Thorsson, V. & Karp, R.M. Discovery of regulatory interactions through perturbation: inference and experimental design. Pac. Symp. Biocomput. 305–316 (2000).
- Hartemink, A.J., Gifford, D.K., Jaakkola, T.S. & Young, R.A. Combining location and expression data for principled discovery of genetic regulatory network models. Pac. Symp. Biocomput. 437–449 (2002).
- Pe'er, D., Regev, A., Elidan, G. & Friedman, N. Inferring subnetworks from perturbed expression profiles. Bioinformatics17 Suppl 1, S215–S224 (2001). ArticlePubMedGoogle Scholar
- Segal, E., Taskar, B., Gasch, A., Friedman, N. & Koller, D. Rich probabilistic models for gene expression. Bioinformatics17 Suppl 1, S243–S252 (2001). ArticlePubMedGoogle Scholar
- Yoo, C., Thorsson, V. & Cooper, G.F. Discovery of causal relationships in a gene-regulation pathway from a mixture of experimental and observational DNA microarray data. Pac. Symp. Biocomput. 498–509 (2002).
- Hartemink, A.J., Gifford, D.K., Jaakkola, T.S. & Young, R.A. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput. 422–433 (2001).
- Potter, J.D. At the interfaces of epidemiology, genetics and genomics. Nature Rev. Genet.2, 142–147 (2001). ArticleCASPubMedGoogle Scholar
- Kohane, I.S. Bioinformatics and clinical informatics: the imperative to collaborate. J. Am. Med. Inform. Assoc.7, 512–516 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
- Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R. & Kohane, I.S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA97, 12182–12186 (2000). ArticleCASPubMedPubMed CentralGoogle Scholar
- Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A.F. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics18 Suppl 1, S233–S240 (2002). ArticlePubMedGoogle Scholar
- Chiang, D.Y., Brown, P.O. & Eisen, M.B. Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics17 Suppl 1, S49–S55 (2001). ArticlePubMedGoogle Scholar
- Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J. & Church, G.M. Systematic determination of genetic network architecture. Nature Genet.22, 281–285 (1999). ArticleCASPubMedGoogle Scholar
- Holmes, I. & Bruno, W.J. Finding regulatory elements using joint likelihoods for sequence and expression profile data. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 202–210 (2000). CASPubMedGoogle Scholar
- Shatkay, H., Edwards, S., Wilbur, W.J. & Boguski, M. Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc. Int. Conf. Intell. Syst. Mol. Biol.8, 317–328 (2000). CASPubMedGoogle Scholar
- Masys, D.R. et al. Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics17, 319–326 (2001). ArticleCASPubMedGoogle Scholar
- Jenssen, T.K., Laegreid, A., Komorowski, J. & Hovig, E. A literature network of human genes for high-throughput analysis of gene expression. Nature Genet.28, 21–28 (2001). CASPubMedGoogle Scholar
- Staunton, J.E. et al. Chemosensitivity prediction by transcriptional profiling. Proc. Natl Acad. Sci. USA98, 10787–10792 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
- Radmacher, M.D., McShane, L.M. & Simon, R. A paradigm for class prediction using gene expression profiles. J. Comput. Biol.9, 505–511 (2002). ArticleCASPubMedGoogle Scholar
Acknowledgements
I thank Gene Brown, Lenore Cowen, Steve Haney, Andrew Hill, Steve Rozen and Timm Triplett for helpful discussions and comments.