Anrv317-be09-03.tex

Analysis of Time-SeriesGene Expression Data:Methods, Challenges,and Opportunities I.P. Androulakis,1 E. Yang,1 and R.R. Almon2 1Biomedical Engineering Department, Rutgers University, Piscataway, New Jersey08854; email: [email protected], [email protected] of Biological Sciences, and Department of Pharmaceutical Sciences,State University of New York at Buffalo, Buffalo, New York 14260;email: [email protected] Annu. Rev. Biomed. Eng. 2007. 9:3.1–3.24 Key Words
The Annual Review of Biomedical Engineering is microarrays, bioinformatics, regulation, clustering, This article’s doi:10.1146/annurev.bioeng.9.060906.151904 Abstract
Monitoring the change in expression patterns over time provides the distinct possibility of unraveling the mechanistic drivers character- izing cellular responses. Gene arrays measuring the level of mRNAexpression of thousands of genes simultaneously provide a methodof high-throughput data collection necessary for obtaining the scopeof data required for understanding the complexities of living organ-isms. Unraveling the coherent complex structures of transcriptionaldynamics is the goal of a large family of computational methodsaiming at upgrading the information content of time-course geneexpression data. In this review, we summarize the qualitative char-acteristics of these approaches, discuss the main challenges that thistype of complex data present, and, ﬁnally, explore the opportunitiesin the context of developing mechanistic models of cellular response.
Contents
TEMPORAL GENE EXPRESSION ANALYSIS . . . . . . . . . . . . . . . . . . . . . . 3.2METHODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Point-Wise Distance-Based Clustering Methods . . . . . . . . . . . . . . . . . . . . . 3.5Model-Based Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7Feature-Based Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8Clustering Across Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 CHALLENGES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11 Small Sample Size: Information or Noise? . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11Knowledge-Based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.12Judging the Quality of Gene-Expression Clustering . . . . . . . . . . . . . . . . . . 3.13 OPPORTUNITIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.14 TEMPORAL GENE EXPRESSION ANALYSIS
At any given time a cell will only express a small fraction of the thousands of genesin the organism’s genome. Expressed genes reﬂect the structure and functional ca-pacities of the cell as well as the ability of the cell to respond to external stimuli.
In a complex organism, external stimuli to a great extent take the form of chemi-cal messages whose purpose is to coordinate the function of the complex society ofcells (1). Gene arrays, which measure the level of mRNA expression of thousandsof genes simultaneously, provide a method of high-throughput data collection nec-essary for obtaining the scope of data required for understanding the complexitiesof living organisms. Monitoring the change in expression patterns over time usinggene arrays provides an approach for capturing the multidimensional dynamics ofcomplex biological systems. By using gene arrays in a time series paradigm, we areable to observe the emergence of coherent temporal responses of many interactingcomponents. The data should provide the basis for understanding evolving but com-plex biological processes, such as disease progression, growth, development, and drugresponses.
Global gene expression analysis has been celebrated as a major revolution in mod- ern biology (2). The ability to monitor simultaneously the expression of the genescomposing the entire genome has generated unimaginable possibilities (3–5). Despitesome criticism regarding the cross-platform reproducibility of expression experiments(6, 7), more recent evidence (8, 9) supports the informative nature of the experimentand the importance of the approach (10). Microarray analysis has found widespreadapplications from characterizing terminal states, i.e., benign versus malignant tumors(11), to attempts to decipher the evolution of complex diseases and cell fates (12–16). Hence, the nature of the data broadly deﬁnes the nature of the problems to beaddressed.
Boundary problems use the expression measurements as feature vectors that char- acterize static points in multidimensional spaces. Therefore, multiple samples, for Androulakis · Yang · Almon example, from the same tissue of different patients (diseased/nondiseased) would de-ﬁne a database of multidimensional feature objects with as many dimensions as geneswhose mRNA has been quantiﬁed and as many objects as the number of patientsmonitored. Critical questions then arise, such as how to identify coherent patterns,i.e., combinations of up- or downregulated genes that distinctly characterize the twoor more classes of patients (17, 18).
Monitoring the change in expression patterns over time provides a profoundly different type of information. Instead of concentrating at terminal points in time ofbinary nature (benign versus malignant, type A versus type B, etc.), we now havethe opportunity to observe the emergence of coherent temporal responses of manyinteracting components. The orchestrated response of an organism to an externalstimulus and the monitoring of the temporal progression offer numerous opportu-nities for reverse-engineering the mechanisms that regulate the host responses (19).
The latter, in turn, will deﬁne the rational foundation for the generation of testablehypotheses. Thus, the challenge becomes how to upgrade the information content ofsuch multidimensional trajectories to address critical questions such as the character-ization of the state of evolution of a system; the identiﬁcation of activated pathways,their relation, and the rate limiting steps; and the synthesis of interaction networksand the characterization of points of control.
A number of questions could potentially be addressed that fall, broadly, under the 1. Biological systems analysis: Speciﬁc systems are monitored over time and infor- mation is assembled to understand the driving dynamics. Prototypical examplesinclude cell cycles and circadian clocks (21, 22).
2. Response dynamics: Systems are subjected to controlled perturbations and the broad gene expression response of the system is monitored over time. Examplesinclude drug dosing and deﬁned trauma (39, 50).
3. Development: Morphing of organisms during development involves complex sequences of cell proliferation and differentiation. Many models have been usedover the years (23) to address the process of development. Particularly excitingare the opportunities offered by recent advances in stem cell differentiation(24).
4. Disease progression: Genome-wide temporal proﬁling offers the possibility of elucidating the underlying pathophysiologies of human diseases (25). Insteadof focusing on predeﬁned hypotheses, global expression offers the possibilityof unraveling the systemic evolution of pathological conditions.
The purpose of this review is not to provide a detailed account of the enormous complexities and uncertainties surrounding the collection of the required data (14,26). The following sections will highlight conceptually the basic foundations of thecomputational approaches that have been recently proposed for the analysis of tem-poral gene expression data, the opportunities that exist, and, more importantly, thechallenges that need to be addressed.
www.annualreviews.org • Analysis of Time-Series Gene Expression Data The goal of temporal gene expression analysis is to identify broad sequences of molec-ular events in time. Such sequences can be associated with an ongoing biologicalprocess, such as the cell cycle, circadian rhythms, or development, or can be initiatedby some input perturbation, such as the administration of a drug or a deﬁned trauma.
The host state is deﬁned as the ensemble of all possible metabolites, proteins, smallmolecules, etc., which deﬁne the observed phenotype of the organism. For some pur-poses, such as the consideration of circadian rhythms, the host state may be dynamic.
Any perturbation sets in motion the information transfer that deﬁnes the blueprintfor the production of the relevant components of the response by activating appro-priate genes whose transcription to mRNA and subsequent translation to proteinscatalyzes critical functions. Therefore, the implicit underlying assumption of tran-scription proﬁling is that gene expression is causal to phenotypic responses throughproduction of speciﬁc proteins coded by the expressed mRNAs.
One of the major limitations of monitoring exclusively mRNA transcripts is the role of posttranslation modiﬁcations, mRNA stability, and other destabilizing andcomplicating factors that render the products of transcription (mRNA) an inaccurateproxy for the abundance of active products of translation (27, 28). Nevertheless, anal-ysis of the products of transcription has already provided signiﬁcant insight and isundoubtedly a critical source of information. Associated with expression proﬁling isthe implicit assumption that gene expression is tightly controlled by a ﬁne-tuned, in-tricate, and robust regulatory mechanism that appropriately activates and deactivatesthe machinery guiding the expression of genes. By now it is almost taken for grantedthat genes exhibiting similar responses to signals ought to be controlled by similarregulatory mechanisms. This is often referred to as the guilt by association principle(29). Therefore, identifying coherent expression responses is important in the sensethat if coexpression can be linked to coregulation then the underlying machinerydriving expression can be isolated to smaller groups, deciphered, and quantiﬁed. Oneof the most critical problems is to verify the common regulatory mechanism (30).
Hence, the ﬁrst and most critical step in this endeavor it to identify those measuredtranscripts that appear to be somehow correlated to each other. From a computationalpoint of view, this problem belongs to a more general class, namely, the characteri-zation (indexing and clustering) of multidimensional trajectories (31).
Numerous methods have been developed and applied to this very challenging problem in the context of analyzing gene expression data (20, 32, 33). At the core ofall the methods is the concept of similarity and we will segregate the approaches basedon the relative use of this term. Given that we treat expression measurements as mul-tidimensional trajectories, sampled at discrete points, the ﬁrst deﬁnition of similarityought to be based on some kind of point-wise metric measuring the distance amongthe various objects, using the relative mRNA abundance as the multidimensionalfeature set. Various metrics have been utilized, most notably Euclidean distances.
Methods such as k-means and hierarchical clustering by and large fall under thisfamily of approaches. Pair-wise comparisons are made and then combined to assessthe relative degree of similarity. A second family of methods assumes the existence of Androulakis · Yang · Almon a ﬁnite set of undetermined processes that generate the observations. Two trajecto-ries are thus declared similar if they are the product of similar processes. Therefore,the comparison is made not in the ﬁnite-dimensional space of raw data, but in theinﬁnite-dimensional space of the functions that generate this data. Finally, a thirdgeneral family of methods deﬁnes the similarity in terms of global features and char-acteristics of the trajectories. Instead of focusing on point-wise differences or speciﬁcfunctional forms giving rise to data, these methods aim at identifying structural char-acteristics of the responses and deﬁne similarity based on pattern recognition methodsthat aim at ﬁnding salient changes and similarities in the responses.
A comprehensive review on clustering methods was recently presented in Refer- ence 34. In the following section we attempt to partition qualitatively the methodsby describing the essential characteristics of each approach.
Point-Wise Distance-Based Clustering Methods
Recently, a very nice and concise review of distance-based clustering methods waspresented (35). Without loss of generality, we assume that the data are given in amatrix form: E = {E(i, t}, i = 1, . . . , Ng, t = 1, . . . , Nt, assuming the Ng genes are measured at Nt discrete time points. The goal of point-
wise distance-based clustering methods (PwDbM) is to quantify the distance between
any two samples and agglomerate samples that fall within a predeﬁned thresh-
old. Usual metrics include various deﬁnitions of norm-based distances and com-
binations of known correlation expressions. Indicative deﬁnitions are provided in
Table 1.
The fundamental difference between the various distance-based methods is in the way these distances are being combined to identify the proper partitioning ofthe data. Two major classes of methods exist: (a) partitioning and (b) hierarchical.
Among partitioning methods, the prototypical example is k-means clustering, al-though self-organizing maps (SOM) also follow the same basic principles (15). Usingthese methods, a predetermined number of partitions of the feature space that aredeﬁned by a nominal center are constructed. Points are subsequently assigned to each Table 1 Similarity metrics for PwDbM
dij = max |E(i, t) − E(j, t)| t = 1 (E(i,t)− ¯E(i))(E(j,t)− ¯E(j)) www.annualreviews.org • Analysis of Time-Series Gene Expression Data partition based on their relative proximity to the center with the overall objective-ofminimizing the distance of each point from its respective center: |E(i, t) − M(k, t)|2. In this general deﬁnition of the objective (Error) the objects (temporal gene expressionof Ng genes) have been partitioned to Nk clusters. The purpose of the optimizationsearch is to assign each proﬁle to one of these clusters, such that the sum of thedistances of each proﬁle from the center of the cluster it has been assigned to, M(k, t),is minimized.
Recently, interesting combinations of k-means and kernel methods have emerged (36, 37). Essentially, kernel methods aim at identifying appropriate nonlinear trans-formations of the original data through the use of kernel functions that render thedata linearly separable (38). The distances are then deﬁned on the transformed datarather than the original. Even though kernel-based methods have a number of advan-tages, such as creating separability through transformations or relative robustness tonoise, they do require the identiﬁcation of a number of input parameters that wouldrender the estimation problem user-speciﬁc, and the appropriate estimation of thenecessary parameters is not trivial.
Hierarchical methods create a hierarchy of relative distances (hence the name) and place multidimensional points along a one-dimensional axis based on the rel-ative distance between points. The result of the analysis is presented in the formof a dendogram in which the relative positions of points deﬁnes their relative dis-tance as well. The dendogram is essentially a binary tree with the root represent-ing the entire data set and each leaf node representing a data object. Intermedi-ate nodes represent the extent to which objects are close to each other. Amongthe strongest criticisms raised for classical hierarchical clustering is the fact thatthese algorithms are lacking robustness to noise and are therefore sensitive tooutliers.
The literature in terms of applications of distance-based clustering in the analysis of microarray data is really abundant. Typical examples of the applications of suchmethods include the work of Eisen and colleges and Gash and colleges (39, 40) forhierarchical clustering and the work of Tavazoie and colleges (41) using k-means clus-tering. Eisen et al. (39) clustered expression data in the budding yeast Saccharomycescerevisiae to deduce that clustering gene expression data grouped together genes ofknown similar function, interpreting this observation as an indication of the statusof cellular processes. They applied hierarchical clustering where the linkage was de-termined using a similarity score based on a correlation coefﬁcient. Gash et al. (40)evaluated the response of yeast to numerous environmental perturbations to unravelthe effects of environmental stresses on the cell. Tavazoie et al. (41) measured 15 timepoints across two cell cycles of Saccharomyces cerevisiae and analyzed the results us-ing variance normalized proﬁles and k-means clustering. Clusters were subsequentlycharacterized based on their relative functional enrichment to demonstrate the con-centration of similar functions within each cluster.
Androulakis · Yang · Almon The use of correlation-based distance metrics, such as Pearson’s coefﬁcient, rep- resents a variation upon Euclidean distance by providing a scale-free distance metricbetween two feature vectors. This metric has been widely used as well and is particu-larly useful when the baseline magnitude of different mRNA messages differ greatly.
Model-Based Clustering Methods
Model-based clustering methods (MbCMs) (42–46) shift the similarity emphasis fromthe data to an unknown model that describes the data. These are methods based onvariants of mixture-models (44). The general idea is that ﬁnite mixtures of distribu-tions provide a ﬂexible approach to modeling. Therefore, each point (i.e., expressionproﬁle) is taken to be the outcome of the superposition of a ﬁnite number of pro-cesses, much like expansion over a basis set, with a number of unknown parameters tobe determined based on the available experimental data. Therefore, the objective isto identify this underlying set of functions (models) whose appropriate combination(mixture) assigns the data properly. The existence of such a ﬁnite and coherent set ofbasis functions indicates the existence of an underlying set of limited common pro-cesses that give rise to the observed behavior. Without loss of generality we will usethe formalism of Pan et al. (42) to illustrate the approach. Let y denote any measure-ment, then each such data point is assumed to be the superposition of distributionsgiven by The density function, ϕ, depends on appropriate parameters, μ and V, which, forexample, could correspond to the mean and covariance matrix. These mixtures areappropriately weighted by means of the mixing proportion π. Thus the model to beestimated based on the data is composed of the triplet (π, μ, V), with parametersdetermined through the use of appropriate expectation-maximization algorithms.
It is important to realize the fundamental difference between the distance-based and the model-based approaches. The emphasis is now placed on the speculated un-derlying model, thus making the approach more robust in the presence of noisy data.
However, the assumption is that such underlying processes do exist, requiring thatthe data follow a set of predetermined distributions. A slight variation was recentlyproposed in Reference 45, whereby an autoregressive model able to account for timedelays was assumed to exist and was subsequently estimated based on the data. Similarin spirit are methods based on hidden Markov models (47–49), which assume an un-derlying HMM describing the sequence of events corresponding to the transformedtemporal gene expression proﬁles. An interesting method was proposed (50) in whicha linear dynamic model is invoked to simulate the level of mRNA that gives rise totime-dependent proﬁles, which are considered to be sums of exponentials. The as-sociated parameters of the model are estimated through nonlinear regression. Thenumber of exponentials is also minimized by making use of the concept of informa-tion theoretic arguments quantifying Occam’s Razor, such as minimum descriptionlength (51) and Akaike information criterion (52). Model-based approaches have been www.annualreviews.org • Analysis of Time-Series Gene Expression Data proposed (53, 54) that consider the cell to be a system where the behaviors (responses)of the cell depend completely on the current internal state plus any external inputs,and the proposed method regards a time-course gene expression dataset as a set oftime series generated by a number of stochastic processes. Each stochastic processdeﬁnes a cluster and is described by an autoregressive model. Along those lines, sig-niﬁcance analysis of time-course microarray experiments was also recently proposedas a competitive alternative (55). The method is applicable to detecting changes inexpression over time within a single biological group and to detecting differences inthe behavior of expression over time between two or more groups.
In summary, the fundamental assumption of model-based approaches is that the expression proﬁles are clusters in the space of the functionals that characterize them.
The question thus becomes how to identify this functional decomposition of the data,as opposed to decomposing the raw data. One of the key drivers for such methodsis the speculation that gene expression proﬁles are generated by time-dependentmodels, in the sense that the current state is a function of the cellular state at previoustimes (45). Therefore, these methods attempt to quantify this assumption.
Feature-Based Clustering Methods
Feature-based clustering methods (FbCMs) aim at detecting salient features and localor global shapes characteristic of the expression proﬁles. One of the key motivatingarguments for such methods is the realization that in the presence of noise and uncer-tainties associated with measuring mRNA abundance, looking for speciﬁc quantiﬁablemetrics may not necessarily yield the most informative interpretation. Instead, ro-bust, coherent, and dominating qualitative features and similarities could be a moreinformative proxy for the information content of the expression experiment. The rawdata are transformed to sequences of events or symbols, and these are further analyzedfor consistencies, either local or global (56). Looking for general shapes as opposedto quantifying distances allows for, among other things, a more ﬂexible representa-tion, which uncovers more intricate relations among expression proﬁles, such as timeshifts and inversion in expression proﬁles (57). Syeda-Mahmood (58) has proposed apattern recognition approach aimed at capturing salient features of the time-varyinggene expression patterns, such as inﬂection points based on the idea that dissimilarcurves, when represented as two-dimensional curves, show a signiﬁcant number oftwists and turns. A new framework was recently proposed (59, 60). Both approachesshare a critical similarity: The transformation of the raw expression data to a sequenceof symbols and the subsequent analysis of the symbolic representation of the timeseries. This type of approach, motivated by recent advances in the symbolic represen-tation of streaming data (61), effectively reduces the dimensionality of the time seriesfrom an inﬁnite-dimensional space (continuous representation of expression level) toa ﬁnite, quantized representation where each proﬁle is represented by a sequence ofsymbols. In effect, the most signiﬁcant variation introduced by these methods is aﬁne-grained clustering, with a potentially enormous number of clusters deﬁned.
There have been subsequent signiﬁcant variations in both methods. One is based on the relative probabilities of each symbolic sequence (59) and the other is based Androulakis · Yang · Almon on the ability of selected subsets to reproduce the overall dynamic response (60),with selection criteria ranking the importance of the respective clusters. Because themethod proposed by Ernst & Bar-Joseph (59) needs to postulate a priori the putativesequence of events, the method is best suited for short time series, whereas the methodproposed Yang et al. (60) has complexity that is effectively linear with respect to thenumber of genes.
Interesting algorithms for clustering (expression) data are emerging, exploring graph-theoretic properties. We discuss them in Feature-Based Clustering Methodsbecause essentially the structure of the graph created from the original data is ana-lyzed. In other words, in graph-based methods, the nature, structure, properties, andcharacteristics of a graph whose edges represent data points and the arcs relative dis-tances between those points are treated as the features to be further analyzed. Thus,subgraphs are formed and identiﬁed containing enough nodes for effective similar-ity computations. Effectively, a tree representation converts the multidimensionalproblem to a tree partitioning problem.
Among the most popular methods are those that explore the concept of the mini- mum spanning tree (62) and effectively attempt to identify cliques within the data set(63, 64). An interesting extension of the MST concept is discussed in Reference 65where a metric for assessing the clustering potential based on geometric argumentsis presented. Assessing the “clusterabolity” potential of a dataset a priori will greatlyenable further analyses. In the case of temporal data, this remains an open question.
FbCMs offer a higher degree of ﬂexibility. Appropriate selection of characteristic features offers the possibility of deﬁning a time-course representation using vari-ables that potentially capture intrinsic and implicit characteristics of the responses.
Undoubtedly, the deﬁnition of such features results in some kind of lumping of theresponse, which can potentially result in loss of ﬁne-grain detail.
Clustering Across Conditions
Each gene expression experiment is essentially a set of observations generated froma single perturbation of the system, whether it is a particular growth condition, aninjury, or the administration of a drug. It can be argued that extracting informationfrom a single perturbation contains little information. Therefore, increased methodsthat attempt to simultaneously analyze multiple conditions are continually attractingincreased attention (66). Bi-clustering in the context of gene expression analysis wasﬁrst suggested by Church (67) and refers to simultaneous clustering across “columns”and “rows” in expression data by expanding the concept of similarity so that it doesnot become a function of pairs of genes or pairs of conditions, as is normally the case,but rather it becomes a measure of coherence of the genes and conditions. Heard andcoworkers (68) expanded their original Bayesian model-based agglomerative cluster-ing scheme (69) for time-course data. Their approach uses a spline approximationto capture the temporal variation within each cluster. The approach is particularlyintriguing in that the time courses are explicitly treated. Undoubtedly, biclustering(or coclustering) methods hold tremendous promise as more systemic perturbationsare becoming available and the need to develop consistent representations across www.annualreviews.org • Analysis of Time-Series Gene Expression Data multiple conditions are required. The underlying assumption, as argued below, isthat biological systems are treated as systems in which external perturbations areapplied. Therefore, the underlying dynamics should be consistent across conditionsindependent of the type of the perturbation to assess the biologically informative na-ture of conclusions drawn from any kind of computational analysis of transcriptionalresponses.
Clustering can be characterized as the process of establishing associations. At a con-
ceptual level, the nature of the associations becomes more abstract as the methods
evolve from hierarchical, to partition, to model, to feature based. Figure 1 depicts
this increased level of abstraction. Hierarchical clustering would basically associate
the two convex and the concave hypothetical patterns of expression by quantifying
the relative differences among all members. K-means or SOM will draw an associa-
tion between the raw proﬁles and the putative centers of the domains within which
each proﬁle lies (centers indicated by the dashed lines). Model-based methods will
establish the association between individual proﬁles and functional representation ac-
cording to the values of the model parameters. Objects therefore are associated with
sets of parameters. Finally, feature-based methods will associate each proﬁle with
macroscopic features characteristic of the overall shape of the response. The relative
distance between transformed individual members deﬁne the proximity. One could
Notional comparison ofclustering methods. Giventhe four hypothetical trajectories, adistance-based methodwill compare thesimilarities pointwise,potentially creating anappropriate dendogramquantifying such distances.
A model-based approachwill attempt to quantify afunctional description ofthe data in the form of ageneralized model “f,”whereas a feature-basedmethod will attempt toidentify critical features, such as a sequence ofevents or trends, shared byvarious elements.
Androulakis · Yang · Almon argue that even feature-based methods are essentially distance-based methods. How-ever, the transformation from raw data to features relaxes the proximity restrictionsand allows for the introduction of soft comparisons.
CHALLENGES
Clustering is by deﬁnition an unsupervised task. We can loosely deﬁne clustering asthe process of organizing objects (expression proﬁles) into groups whose membersare similar in some way. In evaluating the effectiveness of clustering, one could arguethat if the groups are similar given some metric, then the clustering was successful.
However, the fascinating thing in biology is that similarity in the input space is not theﬁnal arbitrator. If the genotype is the input, the actual observable is the phenotype.
The information encoded by the objects of the clusters, the mechanisms that broughtthe objects together, the implications of bringing the objects together, in a nutshell,the biological insight gained by analyzing the objects that were brought together iswhat will decide the effectiveness of the computational analysis. Therefore, deﬁningthe quality of the clustering algorithms is not as straightforward as it may appear. Amajor challenge in the clustering of microarray data lies in the fact that the metric forevaluating the overall quality of a result is still an open area of research (70). Withouta well-deﬁned metric, it becomes difﬁcult to ascertain which method outperformsthe others.
Various evaluations have been proposed to quantify the relative advantages of clustering methods for microarray expression data (71, 72), and the metrics for com-parison quantiﬁed the ability of various methods to generate well-separated clusters.
By and large any such comparison is biased and the results to a great extent dependon the speciﬁc use of the method as well as the nature and type of data. We believethat a head on comparison between clustering methods based exclusively on someoptimality criterion will probably be misleading. The complexities of the underlyingbiological system will probably render such analysis mute. Methods should be eval-uated, and not compared, based on their ability to generate insight information andit is quite possible that the evaluation could be problem dependent. It is critical torealize that the computational steps, in the context of transcriptional analysis, shouldbe an integrated component of the overall effort and a separate independent activity.
Therefore, the effectiveness of a computational approach should be evaluated in thegrand scheme of the biological content of a speciﬁc analysis.
In the sections that follow, we identify three elements of the computational analysis of time-course gene expression data that we believe could potentially impact theconclusions drawn. Thorough and detailed analyses of the challenges associated withhigh-dimension clustering in general have been nicely presented elsewhere (73).
Small Sample Size: Information or Noise?
The term “data deluge” is often used in conjunction with microarray data (74). How-ever, this could not be a more misleading characterization. There is no doubt that theobservables in a microarray experiment are in the thousands, particularly in temporal www.annualreviews.org • Analysis of Time-Series Gene Expression Data experiments. A typical animal study with m replicates (animals) at n time pointsrecording k genes would produce m × n × k data points. However, the number ofobjects, in terms of the machine-learning problem, is quite minimal, and deﬁnitelynot up to par with the number of features. Examples of the types of objects we referto include number of patients in a cancer study or number of system perturbations(types/severity of trauma, or drug dosing).
Technological and other practical limitations severely restrict either the number of time points that can be measured or, more importantly, the number of biologicaland technical replicates that can be used. In the machine-learning community, this isan age-old problem known as learning in almost empty spaces (75). In such cases, it isquite difﬁcult to distinguish noise from structure unless something is known about theunderlying concept generating the data. A simple, yet informative example, of errorsintroduced by subsampling is presented in Reference 76. New technologies that areemerging, such as the living cell array (77), which will provide extensive data at least formodel systems, will expose the host system to a wide range of insults and will create amore integrated list of cause-effect relationships. Currently, the only way to conditionthe data to overcome the lack of a critical mass of observations is to couple theexpression data with available prior biological information and analyze simultaneouslymultiple perturbations. The inability of sparse data to properly capture the complexityof a classiﬁcation problem is also discussed in Reference 76, however, recent advancesin theoretical work on clustering sparse data (78–80) will signiﬁcantly help.
As noted above, a key complexity of microarray experiments is the essential lack of observables (cell lines or tissue samples) to support the large number of probesmonitored. The consequences of the small ratio of features to samples in microar-rays was discussed in Reference 81 and a nice discussion of the impact of the smallsample size problem in array expression data is presented in References 82 and 83,which comment on the required optimal number of samples required for robust es-timation under certain assumptions regarding the distribution of the measurements.
The implication of the ratio of features to samples is critical, as sparsely populateddatasets can very easily lead to random features appearing to be informative. It shouldbe expected that simple minimization of the number of features (genes) in a modelneed not necessarily provide the best answer. Additional complexity restrictions willhave to be imposed to balance the lack of available data, although no deﬁnite answercan be provided as no analysis can replace accurate and adequate data. Recently, anovel method for characterizing the information content of short time-course geneexpression data was presented by effectively quantifying the random nature of thesignal encoded in the expression time series (84).
Knowledge-Based Clustering
Although genome-wide mRNA expression analysis is slowly becoming a routine tool,translating computational results to biological information remains a major challenge.
As previously mentioned, one of the key challenges is the improper conditioning ofthe data. Approaches are being developed that attempt to integrate prior knowledgeinto the analysis of expression data. In a report by Pan (85), the mixture model for Androulakis · Yang · Almon clustering expression data is extended to incorporate gene ontology information asprior knowledge to increase the speciﬁcity of the method (85–87). To take advantage ofaccumulating gene functional annotations, Huang & Pan (88) proposed incorporatingknown gene functions into a new distance metric that shrinks a gene expression-baseddistance toward 0 if and only if the two genes share a common gene function.
The incorporation of biological (or any type of prior knowledge) into clustering algorithms will be greatly enabled by recent advances in the area of constraint-basedclustering (89, 90) which aims at developing consistent methodologies that incorpo-rate prior knowledge during the analysis, as opposed to postprocessing the resultsto validate the consistency of the conclusions given what is known about the sys-tem. However, one needs to be aware of the constraints that explicit, hard modelingof prior knowledge imposes in terms of discovering new knowledge about the sys-tem: Over-restricting and constraining the analysis goes against the very essence ofintegrative -omics approaches and data-driven (systems) approaches, as opposed tohypothesis-driven research.
Judging the Quality of Gene-Expression Clustering
Clearly there are a number of analytical rationales used to parse genes into groups.
However, the quality of all grouping must be judged based on their ability to provideinsight into the underlying mechanistic biology. A general concern regarding the va-lidity of existing algorithms stems from the observation that classiﬁcation algorithmscan lead to conﬂicting results, which are often method dependent (91). The currentpractice is to evaluate methods based on their ability to generate results consistentwith biological reality in terms of functional ontologies and putative transcription fac-tors of coexpressed genes (92–95). Although it is not surprising that different methodsyield different results, the fact is, there is a correct answer. Living organisms exist asa ﬁne balance between entropy and enthalpy. Maintaining such a balance requiresthat the expression of the thousands of genes in the organism’s genome be highlycoordinated. At any point in time, the amount of mRNA for a particular gene is thebalance between its synthesis and degradation. Processes such as circadian rhythms orinput perturbations such as drugs can change the amount of an mRNA by increasingor decreasing synthesis, degradation, or some combination of both. The power of thegene array time series is that it allows the observer to broadly “watch” the dynamics ofthe system. The objective in conducting time-series experiments is to understand thecomplex sequence of regulatory events that drives the system. Clustering, regardlessof method, attempts to parse genes into groups with certain deﬁned commonalities.
These groups are useful to the biologist to the degree that they represent genes with common mechanisms of regulation. In essence, each proffered group representsa testable hypothesis. If the hypothesis is, correct then certain biological requirementsfollow. For example, if a group of genes is regulated by a common mechanism, thentheir response to a different input perturbation should be the same. On the one hand,if the process being examined is natural, such as development, cell cycle, or circadianrhythm, then a perturbation that disrupts the natural process should change theproﬁle over time of all genes in a cluster. To the degree that it does not, then it www.annualreviews.org • Analysis of Time-Series Gene Expression Data suggests that the cluster is not entirely valid. On the other hand, if the process beingexamined is an input to a biological system, such as a drug treatment, then genesthat belong in the same cluster should have the same response proﬁle regardlessof dosing regimen. In reality, a single temporal response proﬁle probably does notprovide sufﬁcient constraint to accomplish biologically valid mechanistic clustering.
A second test of the validity of clusters involves the mechanism of control of geneexpression. The expression of genes is controlled by transcription factors (TFs). TFsare gene products, proteins that bind to speciﬁc sites in the DNA and either promoteor inhibit the expression of a gene. Some TFs act on their own or in combinationwith other TFs, whereas some, such as the glucocorticoid receptor and the estrogenreceptor, require the binding of an external ligand for activation. If a group of genesis regulated by a common mechanism, then they should contain common features intheir regulatory regions. However, because transcription binding site (TFBS) motifsare short (5–9 base pairs) and fairly degenerate, most putative TFBS matches occurby chance alone and are not functional. One method that has been proposed toidentify which TFBS are bona ﬁde functional sites is excluding those that are notin evolutionarily conserved regions. Indeed, the upstream noncoding regions do notevolve in a uniform fashion among sites, but rather show blocks of fairly conservedareas interspersed with fast-evolving stretches. These fast-evolving stretches quicklylose homology with evolutionary time and are subject to insertions and deletions.
But nonetheless, even among comparatively distant taxa (e.g., rodents and humans),conserved, alignable segments are preserved (96, 97). Identifying common featuresin the regulatory region of genes in putative clusters not only provides a degree ofvalidation but also should provide insight into the mechanism of regulation.
OPPORTUNITIES
The traditional way of interpreting time-course expression data is to evaluate the bio-logical similarities implied as a result of expression proﬁles similarities, and currentlyan enormous number of publications have been presented advocating the potentialfor such analyses (12, 13, 15, 57, 98).
A healthy body of literature utilizes time-course expression data to reverse- engineer primarily regulatory networks, given that interference at the level of regula-tion holds signiﬁcant promise for drug discovery. Targeting expression by controllingthe regulatory process through the corresponding transcription factors is emerging asa viable option for the identiﬁcation of drug targets (99, 100) and controlling diseaseprogression (101). In recent years, signiﬁcant efforts have been made experimentally,and computationally, to identify transcription factors, their target genes, and the in-teraction mechanism that control (regulate) gene expression (102, 103). Prominentexamples are the decomposition-based methods, which combine ChiP and microarraydata, and inversion of regression techniques to estimate transcription factor activities(TFAs) (104–107). Singular value decomposition and regression methods were com-bined (19) to reverse-engineer regulatory networks, and in a report by Bussemakeret al. (108), promoter elements were linearly combined to quantify the contributionof the promoter architecture on a gene’s expression. Network component analysis Androulakis · Yang · Almon (NCA) (109–113) was introduced as an alternative for quantifying the strength of theregulatory interactions and for elucidating true TFAs. Similarly, others (114) exploreda linear superposition of expression proﬁles and TFA combined appropriately usingbinding afﬁnities in lieu of stoichiometric coefﬁcients, and a Bayesian error analysis ofan, effectively, linear method was presented in Reference 114. The main goal of thisreverse-engineering is to identify the activation program of transcription modulesunder particular conditions (115) so as to hypothesize how activation/deactivationof expression can be induced/suppressed (116). A fundamental difference among themethods is whether the weights of the approximation should be estimated throughregression (109–113) or associated with binding afﬁnities (114). To gain more mech-anistic insight, recent approaches aim at combining time-course expression data andsemimechanistic models of gene expression in an effort to evaluate the kinetics ofgene expression. In a report by Yugi et al. (117), a microarray data-based kineticmethod (MASK) was propose that combined expression data with RNA synthesiskinetic models to evaluate kinetics parameters of the expression activation and re-pression processes, whereas in a report by Thomas et al. (118), the so-called S-systemframework was explored (119).
Truly fascinating, however, are the opportunities offered by combining time- course expression data with mechanistic-models of expression in the form of phar-macokinetic/pharmacodynamic expressions. Using gene arrays in the time-seriesparadigm can provide the scope of data necessary for analyzing dynamic complexbiological phenomena. The time series can capture the dynamic nature of processessuch as disease progression or drug responses, whereas the gene arrays provide amethod of high-throughput data collection necessary to address the complexity. Foryears, complex pathologies such as diabetes, hypertension, and obesity have, for themost part, been addressed one or two genes at a time. Although such pathologies maybe instigated by a single gene defect induced through gene knockout, in general, thisis not the case. For example, a major animal model for obesity and related patholo-gies is the ZDF rat (120). This rat contains a single gene defect, the leptin receptor.
In reality, this is not the human condition. Obesity and metabolic syndrome resultsfrom a complex interplay of many genes in multiple tissues (121). However, takingadvantage of the opportunity to analyze such dynamic complex biological phenom-ena requires quantitative approaches that are able to accommodate both the dynamicsand the scope of data. Indirect effect mathematical modeling provides an approach toaddressing this problem. Although developed for pharmacokinetics and pharmaco-dynamics, the basic approach can be applied to any dynamic biological system (122).
The basic premise of such modeling is that a measured response (R) to an input per-turbation may be produced by indirect mechanisms; for example, factors controllingthe input or production of the response variable (kin) may be either inhibited or stim-ulated, or the determinants of loss of the response variable (kout) may be inhibited orstimulated. The rate of change of the response over time with no input perturbationpresent can be described by www.annualreviews.org • Analysis of Time-Series Gene Expression Data a Inhibition – kin
b Inhibition – k
Response
Response
c Stimulation – kin
d Stimulation – k
Response
Response
Figure 2
Four basic mechanism-based indirect effect models for response dynamics indicating
production and consumption inhibition, and production and consumption stimulation.
where kin represents the zero-order constant for production of the response and kout
deﬁnes the ﬁrst-order rate constant for loss of the response. It is assumed that kin
and kout fully account for production and loss of the response. The response variable
R may be a directly measured entity or an observed response, which is immediately
proportional to the concentration of R. The basic assumption that both production
and loss can be stimulated or inhibited leads to the four basic equations shown in
Figure 2. The initial input perturbation is used as the driving force for the primary
response or set of responses. However, the primary response(s) can then be employed
as the driving force for a set of secondary responses. Using this approach, dynamic
models for ever-more complicated converging and diverging sequences of molecular
events in time can be constructed. For example, suppose a drug enhances the ex-
pression of two different transcription factors. These represent primary responses. If
these transcription factors change the expression of other genes, then these become
secondary responses. Changes brought by these genes become tertiary responses. In
this way, the use of the four basic models can be used to construct experimentally
testable models for quite complex response cascades. However, clustering of gene
array time series data not only provides the foundation of such dynamic models but
also determines their validity.
Androulakis · Yang · Almon ACKNOWLEDGMENTS
The authors would like to acknowledge insightful comments, suggestions, and guid-ance from Prof. W.J. Jusko and Prof. D. DuBois. I.P.A. and E.Y. acknowledge supportfrom the National Science Foundation under an NSF-BES 0519563 Metabolic En-gineering Grant and the Environmental Protection Agency under grant EPA-GADR 832721-101. R.R.A. acknowledges support by grants GM 24211 and GM 67650from the National Institute of General Medical Sciences, NIH, Bethesda, MD, andby a grant from NASA.
1. Alberts B, Johnson A, Lewis J, Raff M, Roberts K, Walter P. 2002. Molecular Biology of the Cell. New York: Garland Sci.
2. Kafatos FC. 2002. A revolutionary landscape: the restructuring of biology and its convergence with medicine. J. Mol. Biol. 319(4):861–67 3. Bowtell DD. 1999. Options available—from start to ﬁnish—for obtaining ex- pression data by microarray. Nat. Genet. 21(Suppl. 1):25–32 4. Brown PO, Botstein D. 1999. Exploring the new world of the genome with DNA microarrays. Nat. Genet. 21(Suppl. 1):33–37 5. Cheung VG, Morley M, Aguilar F, Massimi A, Kucherlapati R, Childs G. 1999.
Making and reading microarrays. Nat. Genet. 21(Suppl. 1):15–19 6. Tan PK, Downey TJ, Spitznagel EL Jr, Xu P, Fu D, et al. 2003. Evaluation of gene expression measurements from commercial microarray platforms. NucleicAcids Res. 31(19):5676–84 7. Miklos GL, Maleszka R. 2004. Microarray reality checks in the context of a complex disease. Nat. Biotechnol. 22(5):615–21 8. Larkin JE, Frank BC, Gavras H, Sultana R, Quackenbush J. 2005. Independence and reproducibility across microarray platforms. Nat. Methods 2(5):337–44 9. Irizarry RA, Warren D, Spencer F, Kim IF, Biswal S, et al. 2005. Multiple- laboratory comparison of microarray platforms. Nat. Methods 2(5):345–50 10. Schena M, Shalon D, Davis RW, Brown PO. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science270(5235):467–70 11. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. 1999. Molec- ular classiﬁcation of cancer: class discovery and class prediction by gene expres-sion monitoring. Science 286(5439):531–37 12. Jayaraman A, Yarmush ML, Roth CM. 2005. Evaluation of an in vitro model of hepatic inﬂammatory response by gene expression proﬁling. Tissue Eng. 11(1–2):50–63 13. Huang S, Eichler G, Bar-Yam Y, Ingber DE. 2005. Cell fates for high- dimensional attractor states of a complex regulatory network. Phys. Rev. Lett.
94:128701 14. Cobb JP, Mindrinos MN, Miller-Graziano C, Calvano SE, BakerHV, et al.
2005. Application of genome-wide expression analysis to human health anddisease. Proc. Natl. Acad. Sci. USA 102(13):4801–6 www.annualreviews.org • Analysis of Time-Series Gene Expression Data 15. Eichler GS, Huang S, Ingber DE. 2003. Gene expression dynamics in- spector (GEDI): for integrative analysis of expression proﬁles. Bioinformatics19(17):2321–22 16. Deleted in proof17. Khan J, Wei JS, Ringn´er M, Saal LH, Ladanyi M, et al. 2001. Classiﬁcation and diagnostic prediction of cancers using gene expression proﬁling and artiﬁcialneural networks. Nat. Med. 7(6):673–79 18. Greer BT, Khan J. 2004. Diagnostic classiﬁcation of cancer using DNA mi- croarrays and artiﬁcial intelligence. Ann. NY Acad. Sci. 1020:49–66 19. Yeung MK, Tegner J, Collins JJ. 2002. Reverse engineering gene networks using singular value decomposition and robust regression. Proc. Natl. Acad. Sci. USA99(9):6163–68 20. Bar-Joseph Z. 2004. Analyzing time series gene expression data. Bioinformatics 21. Straume M. 2004. DNA microarray time series analysis: automated statistical assessment of circadian rhythms in gene expression patterning. Methods Enzymol.
383:149–66 22. Simon I, Barnett J, Hannett N, Harbison CT, Rinaldi NJ, et al. 2001. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 106(6):697–708 23. Altenhein B, Becker A, Busold C, Beckmann B, Hoheisel JD, Technau GM.
2006. Expression proﬁling of glial genes during Drosophila embryogenesis. Dev.
Biol. 296(2):545–60 24. Ko MS. 2006. Expression proﬁling of the mouse early embryo: reﬂections and perspectives. Dev. Dyn. 235(9):2437–48 25. Calvano SE, Xiao W, Richards DR, Felciano RM, Baker HV, et al.
2005. A network-based analysis of systemic inﬂammation in humans. Nature437(7061):1032–37 26. Heller MJ. 2002. DNA microarray technology: devices, systems, and applica- tions. Annu. Rev. Biomed. Eng. 4:129–53 27. Greenbaum D, Colangelo C, Williams K, Gerstein M. 2003. Comparing pro- tein abundance and mRNA expression levels on a genomic scale. Genome Biol.
4(9):117 28. Gygi SP, Rochon Y, Franza BR, Aebersold R. 1999. Correlation between pro- tein and mRNA abundance in yeast. Mol. Cell Biol. 19(3):1720–30 29. Allocco DJ, Kohane IS, Butte AJ. 2004. Quantifying the relationship between coexpression, coregulation and gene function. BMC Bioinform. 5:18 30. Park PJ, Butte AJ, Kohane IS. 2002. Comparing expression proﬁles of genes with similar promoter regions. Bioinformatics 18(12):1576–84 31. Vlachos M, Hadjieleftheriou M, Gunopulos D, Keogh E. 2006. Indexing mul- tidimensional time-series. VLDB J. 15(1):1–20 32. Jiang DX, Tang C, Zhang AD. 2004. Cluster analysis for gene expression data: a survey. IEEE Trans. Knowl. Data Eng. 16(11):1370–86 33. Schliep A, Costa IG, Steinhoff C, Schonhuth A. 2005. Analyzing gene expres- sion time-courses. IEEE/ACM Trans. Comp. Biol. Bioinform. 3(2):179–93 Androulakis · Yang · Almon 34. Xu R, Wunsch D. 2005. Survey of clustering algorithms. IEEE Trans. Neural 35. D’Haeseleer P. 2005. How does gene expression clustering work? Nat. Biotechnol. 36. Camastra F, Verri A. 2005. A novel kernel method for clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5):801–5 37. Smola AJ, Scholkopf B. 1998. On a kernel-based method for pattern recognition, regression, approximation, and operator inversion. Algorithmica 22(1–2):211–31 38. Scholkopf B, Smola A, Muller KR. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput. 10(5):1299–19 39. Eisen MB, Spellman PT, Brown PO, Botstein D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA95(25):14863–68 40. Gasch AP, Spellman PT, Kao CM, Carmel-Harel O, Eisen MB, et al. 2000.
Genomic expression programs in the response of yeast cells to environmentalchanges. Mol. Biol. Cell 11(12):4241–57 41. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. 1999. Systematic determination of genetic network architecture. Nat. Genet. 22(3):281–85 42. Pan W, Lin J, Le CT. 2002. Model-based cluster analysis of microarray gene- expression data. Genome Biol. 3(2):RESEARCH0009 43. Ghosh D, Chinnaiyan AM. 2002. Mixture modelling of gene expression data from microarray experiments. Bioinformatics 18(2):275–86 44. McLachlan GJ, Bean RW, Peel D. 2002. A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18(3):413–22 45. Ramoni MF, Sebastiani P, Kohane IS. 2002. Cluster analysis of gene expression dynamics. Proc. Natl. Acad. Sci. USA 99(14):9121–26 46. Holter NS, Maritan A, Cieplak M, Fedoroff NV, Banavar JR. 2001. Dynamic modeling of gene expression data. Proc. Natl. Acad. Sci. USA 98(4):1693–98 47. Schliep A, Steinhoff C, Schonhuth A. 2004. Robust inference of groups in gene expression time-courses using mixtures of HMMs. Bioinformatics 20(Suppl.
1):I283–89 48. Schliep A, Schonhuth A, Steinhoff C. 2003. Using hidden Markov models to analyze gene expression time course data. Bioinformatics 19(Suppl. 1):i255–63 49. Ji X, Li-Ling J, Sun Z. 2003. Mining gene expression data using a novel approach based on hidden Markov models. FEBS Lett. 542(1–3):125–31 50. Giurcaneanu CD, Tabus L, Astola J. 2005. Clustering time series gene expres- sion data based on sum-of-exponentials ﬁtting. Eurasip J. Appl. Signal Process.
2005(8):1159–73 51. Vitanyi PMB, Li M. 2000. Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Trans. Informat. Theory 46(2):446–64 52. Akaike H. 1974. A new look at the statistical model identiﬁcation. IEEE Trans. 53. Wu FX, Zhang WJ, Kusalik AJ. 2005. Dynamic model-based clustering for time-course gene expression data. J. Bioinform. Comput. Biol. 3(4):821–36 www.annualreviews.org • Analysis of Time-Series Gene Expression Data 54. Wu FX, Zhang WJ, Kusalik AJ. 2004. Modeling gene expression from microar- ray expression data with state-space equations. Pac. Symp. Biocomput. 2004:581–92 55. Storey JD, Xiao WZ, Leek JT, Tompkins RG, Davis RW. 2005. Signiﬁcance analysis of time course microarray experiments. Proc. Natl. Acad. Sci. USA102(36):12837–42 56. Balasubramaniyan R, Hullermeier E, Weskamp N, Kamper J. 2005. Clustering of gene expression data using a local shape-based similarity measure. Bioinfor-matics 21(7):1069–77 57. Qian J, Dolled-Filhart M, Lin Y, Yu HY, Gerstein M. 2001. Beyond synex- pression relationships: local clustering of time-shifted and inverted gene ex-pression proﬁles identiﬁes new, biologically relevant interactions. J. Mol. Biol.
314(5):1053–66 58. Syeda-Mahmood T. 2003. Clustering time-varying gene expression proﬁles us- ing scale-space signals. In IEEE Comput. Soc. Bioinform. Conf. (CSB’03), p. 48.
San Jose, CA: IBM Almaden Res. Cent.
59. Ernst J, Bar-Joseph Z. 2006. STEM: a tool for the analysis of short time series gene expression data. BMC Bioinform. 7(1):191 60. Yang E, Berthiaume F, Yarmush ML, Androulakis IP. 2006. An integrative systems biology approach for analyzing liver hypermetabolism. Presented at 9th Int. Symp.
Process Syst. Eng./16th Eur. Symp. Comput. Aided Process Eng. Garmisch-Partenkirchen/Ger: Elsevier 61. Lin J, Keogh E, Lonardi S, Chiu B. 2003. A symbolic representation of time series, with implication for streaming algorithms. In Proc. 8th ACM SIGMODWorkshop Res. Issues Data Min. Knowl. Discov., San Diego, CA 62. Gower JC, Ross GJS. 1969. Minimum spanning trees and single linkage analysis.
63. Xu Y, Olman V, Xu D. 2002. Clustering gene expression data using a graph- theoretic approach: an application of minimum spanning trees. Bioinformatics18(4):536–45 64. Xu Y, Olman V, Xu D. 2001. Minimum spanning trees for gene expression data clustering. Genome Inform. 12:24–33 65. Ho TK, Basu M. 2002. Complexity measures of supervised classiﬁcation prob- lems. IEEE Trans. Pattern Anal. Mach. Intell. 24(3):289–300 66. Madeira SC, Oliveira AL. 2004. Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Trans. Comput. Biol. Bioinform. 1(1):24–45 67. Cheng Y, Church GM. 2000. Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8:93–103 68. Heard NA, Holmes CC, Stephens DA, Hand DJ, Dimopoulos G. 2005.
Bayesian coclustering of Anopheles gene expression time series: study of im-mune defense response to multiple experimental challenges. Proc. Natl. Acad.
Sci. USA 102(47):16939–44 69. Heard NA, Holmes CC, Stephens DA. 2006. A quantitative study of gene regu- lation involved in the immune response of anopheline mosquitoes: An applica-tion of Bayesian hierarchical clustering of curves. J. Am. Stat. Assoc. 101(473):18–29 Androulakis · Yang · Almon 70. Hirano S, Tsumoto S. 2003. Empirical evaluation of dissimilarity measures for time-series multiscale matching. Found. Intell. Syst. 2871:454–62 71. Datta S, Datta S. 2003. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4):459–66 72. Thalamuthu A, Mukhopadhyay I, Zheng XJ, Tseng GC. 2006. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics22(19):2405–12 73. Steinbach M, Ertoz L, Kumar V. 2003. Challenges of clustering high dimen- sional data. In New Vistas in Statistical Physics-Applications in Econophysics, Bioinfor-matics, and Pattern Recognition, ed. LT Wille. Berlin\New York: Springer-Verlag 74. Gershon D. 2002. Dealing with the data deluge. Nature 416(6883):889–9175. Duin RPW. 2000. Classiﬁers in almost empty spaces. In ICPR15 Proc. 15th Int. Conf. Pattern Recognit., Barcelona, Spain, ed. A Sanfeliu, JJ Villanueva, M Vanrell,R Alquezar, AK Kain, J Kittler, 2:1–7. Los Alamitos: IEEE Comput. Soc. Press 76. Ho TK. 2002. A data complexity analysis of comparative advantages of decision forest constructors. Pattern Anal. Appl. 5(2):102–12 77. Thompson DM, King KR, Wieder KJ, Toner M, Yarmush ML, Jayaraman A.
2004. Dynamic gene expression proﬁling using a microfabricated living cellarray. Anal. Chem. 76:4098–4103 78. Ganascia JG, Velcin J. 2004. Clustering of conceptual graphs with sparse data.
Concept. Struct. Work. Lect. Notes Comput. Sci. 3127:156–69. Berlin: Springer-Verlag 79. Partsinevelos P, Agouris P, Stefanidis A. 2005. Reconstructing spatiotemporal trajectories from sparse data. Isprs J. Photogr. Remote Sensing 60(1):3–16 80. Velcin J, Canascia JG. 2005. Default clustering from sparse data sets. Symb. Quant. Approach. Reason. Uncertain. Lect. Notes Comput. Sci. 3571:968–79 81. Jain A, Zongker D. 1997. Feature selection: Evaluation, application, and small sample performance. IEEE Trans. Pattern Anal. Mach. Intell. 19(2):153–58 82. Dougherty ER. 2001. Small sample issues for microarray-based classiﬁcation.
Comp. Funct. Genomics 2(1):28–34 83. Hwang D, Schmitt WA, Stephanopoulos G. 2002. Determination of minimum sample size and discriminatory expression patterns in microarray data. Bioinfor-matics 18(9):1184–93 84. Yang EH, Androulakis IP. 2006. Assessing the information content of short time series expression data. In Proc. 28th IEEE EMBS Annu. Int. Conf., New York 85. Pan W. 2006. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 22(7):795–801 86. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, et al. 2005.
Gene set enrichment analysis: a knowledge-based approach for interpretinggenome-wide expression proﬁles. Proc. Natl. Acad. Sci. USA 102(43):15545–50 87. Hanisch D, Zien A, Zimmer R, Lengauer T. 2002. Co-clustering of biological networks and gene expression data. Bioinformatics 18(Suppl. 1):S145–54 88. Huang D, Pan W. 2006. Incorporating biological knowledge into distance- based clustering analysis of microarray gene expression data. Bioinformatics22(10):1259–68 www.annualreviews.org • Analysis of Time-Series Gene Expression Data 89. Qian Y, Zhang K, Lai W. 2004. Constraint-based graph clustering through node sequencing and partitioning. Adv. Knowl. Dis. Data Min. Lect. Notes Comput. Sci.
3056:41–51 90. Tung AKH, Han J, Lakshmanan LVS, Ng RT. 2001. Constraint-based cluster- ing in large databases. Proc. Int. Conf. Database Theory. Lect. Notes Comput. Sci.,pp. 405–19. Berlin: Springer-Verlag 91. Almon RR, DuBois DC, Jin JY, Yao Z, Hazra A, et al. 2006. Develop- ment, analysis and use of pharmacogenomic time series for pharmacoki-netic/pharmacodynamic modeling of multi-tissue polygenic responses to cor-ticosteroids. In New Research on Pharmacogenetics, Chpt. 2, ed. LP Barnes. NewYork: Nova Sci.
92. Gibbons FD, Roth FP. 2002. Judging the quality of gene expression-based clus- tering methods using gene annotation. Genome Res. 12(10):1574–81 93. Bolshakova N, Azuaje F, Cunningham P. 2005. A knowledge-driven approach to cluster validity assessment. Bioinformatics 21(10):2546–47 94. Yeung KY, Haynor DR, Ruzzo WL. 2001. Validating clustering for gene ex- pression data. Bioinformatics 17(4):309–18 95. Handl J, Knowles J, Kell DB. 2005. Computational cluster validation in postge- nomic data analysis. Bioinformatics 21(15):3201–12 96. Moses AM, Chiang DY, Pollard DA, Iyer VN, Eisen MB. 2004. MONKEY: identifying conserved transcription-factor binding sites in multiple alignmentsusing a binding site-speciﬁc evolutionary model. Genome Biol. 5(12):R98 97. Almon RR, DuBois DC, Jusko WJ. 2005. Corticosteroid-regulated genes in rat kidney: mining time series array data. Am. J. Physiol. Endocrinol. Metab.
289(5):E870–82 98. Almon RR, DuBois DC, Piel WH, Jusko WJ. 2004. The genomic response of skeletal muscle to methylprednisolone using microarrays: tailoring data min-ing to the structure of the pharmacogenomic time series. Pharmacogenomics5(5):525–52 99. Darnell JE Jr. 2002. Transcription factors as targets for cancer therapy. Nat. 100. Levy DE, Darnell JE Jr. 2002. Stats: Transcriptional control and biological impact. Nat. Rev. Mol. Cell. Biol. 3(9):651–62 101. Ruminy P, Gangmeux C, Claeyssens S, Scotte M, Daveau M, Salier MP. 2001.
Gene transcription in hepatocytes during the acute phase of a systemic inﬂam-mation: from transcription factors to target genes. Inﬂamm. Res. 50(8):383–90 102. Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO. 2001. Ge- nomic binding sites of the yeast cell-cycle transcription factors SBF and MBF.
Nature 409(6819):533–38 103. van Steensel B, Delrow J, Bussemaker HJ. 2003. Genomewide analysis of Drosophila GAGA factor target genes reveals context-dependent DNA bind-ing. Proc. Natl. Acad. Sci. USA 100(5):2580–85 104. Alter O, Golub GH. 2004. Integrative analysis of genome-scale data by using pseudoinverse projection predicts novel correlation between DNA replicationand RNA transcription. Proc. Natl. Acad. Sci. USA 101(47):16577–82 Androulakis · Yang · Almon 105. Kato M, Hata N, Banerjee N, Futcher B, Zhang MQ. 2004. Identifying com- binatorial regulation of transcription factors and binding motifs. Genome Biol.
5(8):R56 106. Gao F, Foat BC, Bussemaker HJ. 2004. Deﬁning transcriptional networks through integrative modeling of mRNA expression and transcription factorbinding data. BMC Bioinform. 5:31 107. Boulesteix AL, Strimmer K. 2005. Predicting transcription factor activities from combined analysis of microarray and ChIP data: a partial least squares approach.
Theor. Biol. Med. Model 2:23 108. Bussemaker HJ, Li H, Siggia ED. 2001. Regulatory element detection using correlation with expression. Nat. Genet. 27(2):167–71 109. Liao JC, Boscolo R, Yang YL, Tran LM, Sabatti C, Roychowhury VP. 2003.
Network component analysis: reconstruction of regulatory signals in biologicalsystems. Proc. Natl. Acad. Sci. USA 100(26):15522–27 110. Tran LM, Brynildsen MP, Kao KC, Suen JK, Liao JC. 2005. gNCA: A frame- work for determining transcription factor activity based on transcriptome: iden-tiﬁability and numerical implementation. Metab. Eng. 7(2):128–41 111. Kao KC, Yang YL, Boscolo, Sabatti, Roychowdhury V, Liao JC. 2004. Network component analysis of Escherichia coli transcriptional regulation. Abstr. Pap. Am.
Chem. Soc. 227:U216–17 112. Kao KC, Yang YL, Boscolo R, Sabatti C, Roychowhury V, Liao JC. 2004.
Transcriptome-based determination of multiple transcription regulator activi-ties in Escherichia coli by using network component analysis. Proc. Natl. Acad. Sci.
USA 101(2):641–46 113. Kao KC, Tran LM, Liao JC. 2005. A global regulatory role of gluconeogenic genes in Escherichia coli revealed by transcriptome network analysis. J. Biol. Chem.
280(43):36079–87 114. Sun N, Carroll RJ, Zhao H. 2006. Bayesian error analysis model for re- constructing transcriptional regulatory networks. Proc. Natl. Acad. Sci. USA103(21):7988–93 115. Wang W, Cherry JM, Botstein D, Li H. 2002. A systematic approach to recon- structing transcription networks in Saccharomyces cerevisiae. Proc. Natl. Acad. Sci.
USA 99(26):16893–98 116. Ng A, Bursteinas B, Gao Q, Mollison E, Zvelebil M. 2006. pSTIING: a ‘sys- tems’ approach towards integrating signaling pathways, interaction and tran-scriptional regulatory networks in inﬂammation and cancer. Nucleic Acids Res.
34:D527–34 117. Yugi K, Nakayama Y, Kojima S, Kitayama T, Tomita M. 2005. A microarray data-based semikinetic method for predicting quantitative dynamics of geneticnetworks. BMC Bioinform. 6:299 118. Thomas R, Mehrotra S, Papoutsakis ET, Hatzimanikatis V. 2004. A model- based optimization framework for the inference on gene regulatory networksfrom DNA array data. Bioinformatics 20(17):3221–35 119. Savageau MA. 1985. A theory of alternative designs for biochemical control systems. Biomed. Biochim. Acta 44(6):875–80 www.annualreviews.org • Analysis of Time-Series Gene Expression Data 120. Janssen SW, Martens GJ, Sweep CG, Ross HA, Hermus AR. 1999. In Zucker diabetic fatty rats plasma leptin levels are correlated with plasma insulin levelsrather than with body weight. Horm. Metab. Res. 31(11):610–15 121. Dandona P, Aljada A, Chaudhuri A, Mohanty P, Garg R. 2005. Metabolic syndrome: a comprehensive perspective based on interactions between obesity,diabetes, and inﬂammation. Circulation 111(11):1448–54 122. Dayneka NL, Garg V, Jusko WJ. 1993. Comparison of four basic models of indirect pharmacodynamic responses. J. Pharmacokinet Biopharm. 21(4):457–78 Androulakis · Yang · Almon

Source: http://vc.cs.nthu.edu.tw/home/paper/codfiles/cjhung/200904210526/Analysis%20of%20Time-Series%20Gene%20Expression%20Data.pdf

Geschichte englisch

Lightly I do not speak of happiness, Yet I almost think I am happy here. HOTEL SUVRETTA HOUSE The Engadine - where people greet each other with 'Allegra', a Ladin expression roughly meaning, in today's terms, 'be happy'. A Romansh legend about this patch of land is still told by the inhabitants: "When the archangel had sealed the gate to paradise behind Adam and Eve, God stood in the now-d

Gb taurus.doc

The TAURUS types cover the range from 800kVA to 4000kVA and allow for the choice of several input voltage variation percentages from +15% up to –35%. They are similar in construction to the SIRIUS Y but differ in terms of type of cooling. The stabilisers are either cooled by a combination of air and oil or by a complete oil cooling system. The measuring instrumentation for the TAURUS sta