This shows you the differences between two versions of the page.

clustering [2011/05/23 08:16] aamadoz [Distance functions] |
clustering [2017/05/24 10:36] (current) |
||
---|---|---|---|

Line 1: | Line 1: | ||

+ | ====== Introduction ====== | ||

+ | ===== Class discovery or unsupervised problems ===== | ||

+ | |||

+ | There are two classes of problems extensively addressed in the microarray field: supervised and unsupervised problems. We talk about **unsupervised** problems when we do not know beforehand (or it is not part of our hypothesis) the structure of classes that our data set has. Typical examples of unsupervised problems would be the definition of the molecular variability of a population of bacteria, or finding the groups of co-expressing genes. In none of these cases we know beforehand if we are going to find one unique group of homogeneous individuals or many groups, and we do not have an idea of how many individuals per group we are going to find (in case of discovering groups). This type of problems are known as **unsupervised or class discovery** problems.These problems are approached with a family of methods generically called **clustering methods**. These methods use (implicitly or explicitly) **a distance function and an algorithm to join together these elements** that are more similar among them. Typically arrays are represented as matrices of data in which columns correspond to arrays and rows correspond to genes. See Figure 1. | ||

+ | |||

+ | |||

+ | {{ :images:clustering:clustering_figure1.jpg?100px |Figure 1}} | ||

+ | |||

+ | Common **distance functions** usually estimate the **similarity among elements** (that can be rows, and then we will be clustering genes or columns and then we will be clustering arrays) by using all the elements of the vectors in the comparison. **Important**: if we are clustering genes, all the experiments are supposed to have the same importance and should contribute equally to the result of the distance function. Conversely, if we are clustering arrays we must remember that many of the genes could be displaying irrelevant trends, which might have nothing to do with any reasonable biological hypothesis we can have in mind. And this is specially true in the case of "uncontrolled" samples such as patients, samples from nature, etc. | ||

+ | **The use of different distance functions and different clustering strategies will produce, obviously, different results**. There is nothing wrong in it, we are just grouping the elements (genes or arrays) under **different criteria**. | ||

+ | |||

+ | ===== Gene expression patterns ===== | ||

+ | A **gene expression pattern** (or profile) is a list of **expression data** for a gene along a number of different **experimental conditions** and would correspond to a row in the representation (Figure 2). | ||

+ | |||

+ | {{:images:clustering:clustering_figure2.jpg |Figure 2 }} | ||

+ | |||

+ | ===== Scale Transformation ===== | ||

+ | In many cases a **log transformation** of the gene expression profiles is necessary. If one is working with expression ratios, the patterns will appear in an **asymmetrical scale**: over-expressions will have values between 1 and infinite while over-repressions will be between 1 and 0. **In order to give the same weight to both over-expressions and over-repressions we need to transform the scale** (Figure 3). The most common function used for this task is the **log-transformation**. Check the [[preprocessing|preprocessing section]] of this tutorial for this type of transformation. | ||

+ | |||

+ | {{:images:clustering:clustering_figure3.jpg |Figure 3 }} | ||

+ | |||

+ | ===== Distance functions ===== | ||

+ | |||

+ | Depending on the feature, one is interested on different distance functions can be used. There are two **main families of distances**: | ||

+ | |||

+ | * **euclidean**, that depend on the point to point differences and accounts for absolute differences | ||

+ | * **correlation** that accounts for trends. | ||

+ | |||

+ | {{:images:clustering:clustering_figure4.jpg?350px|Figure 4 }} | ||

+ | |||

+ | Figure 4 shows the different results we can obtain using the different types of distances. Using an **Euclidean distance**, B and C patterns will be clustered with preference to A, on the basis of their lower absolute differences. Using a **correlation**, the two red patterns will be preferentially clustered because of their similar trends. **Usually correlation has more biological meaning**. | ||

+ | |||

+ | |||

+ | The **Pearson Squared** distance measures the similarity in shape between two profiles, but can also capture inverse relationships. For example, consider the following gene profiles: | ||

+ | |||

+ | {{:images:clustering:diagram_pearson_distance_metric.gif|}} | ||

+ | |||

+ | In the figure on the left, the black profile and the red profile have almost perfect Pearson correlation despite the differences in basal expression level and scale. These genes would cluster together with either Pearson Correlation or Pearson Squared distance. In the figure on the right, the black and red profiles are almost perfectly anti-correlated. These genes would be placed in remote clusters using Pearson Correlation, but would be put in the same cluster using Pearson Squared. | ||

+ | ===== Effect of the standarization ===== | ||

+ | In many cases standardising the data is a good practise. Standardisation (actually normalization in statistical textbooks, but we are call it standardisation therein just to not mix this concept with the concept of normalisation of the array) is a simple procedure by means of which we **transform all the data to the same scale by subtracting the mean and dividing by the variance**. **When data are standardised, Euclidean distances account the for trends**. That is, there are equivalent to correlations. | ||

+ | |||

+ | |||

+ | ====== Clustering methods ====== | ||

+ | |||

+ | ===== Clustering with UPGMA ===== | ||

+ | The **UPGMA** is an **agglomerative hierarchical method** (Sneath and Sokal, 1973), which is probably the most popular among the clustering algorithms in the field of microarrays. **It is not the more accurate among the methods but is really extensively used** (D’haeseler, 2005). Hierarchical clustering starts by calculating the all-to-all distance matrix. The two closest patterns are merged and the all-against-all distance matrix is calculated again but using the new cluster instead of the two merged patterns. This process is repeated until the complete dendrogram is built. | ||

+ | |||

+ | {{:images:clustering:clustering_figure5.jpg |Figure 5 }} | ||

+ | |||

+ | Figure 5. **Hierarchical clustering**. A distance matrix all-against-all is obtained (top right of the figure) and the two closest elements (according to the distance function) are clustered. A) genes B and C are the closest elements (red square in the matrix). B) a new matrix is recalculated with the elements B and C as a new pseudo-element. The distance of this element is obtained as the average distance (and the method is know as the average linkage; other ways of calculating this distance constitute different “flavours” of the method, often named as complete linkage, single linkage, etc.). C) the genes A and D are found to be closest in the distance matrix now and then are merged and the matrix rebuilt again. D) the process ends up when all the elements have been linked. | ||

+ | |||

+ | |||

+ | ===== Clustering with SOTA ===== | ||

+ | **SOTA** is a **divisive method** developed by our group (Dopazo and Carazo, 1997; Herrero et al., 2001), which has recently become popular and has been included in several packages (such as the TMeV). SOTA starts the classification with a binary topology composed of a root node with two leaves. The self-organizing process splits the data (e.g. experimental gene profiles) into two clusters (see Figure 8). After reaching convergence at this level, the network is inspected. If the level of variability in one, or more, terminal nodes is over a given threshold, then, the tree grows by expanding these terminal nodes. Two new descendants are generated from this heterogeneous node that becomes internal and does not receive direct updates from the data in future steps. The growth of the network is directed by the resource value that is defined as the mean value of the distances between a node and the expression profiles associated with it (Dopazo and Carazo, 1997; Herrero et el., 2001). This makes SOTA almost independent on cluster size, unlike SOM. | ||

+ | |||

+ | {{:images:clustering:clustering_figure8.jpg |Figure 8 }} | ||

+ | |||

+ | Figure 8. A) **Schematic representation of the SOTA algorithm**. The neurons that compose the network are represented as vectors, whose components correspond to the rows (or columns) in the original data matrix. On the top is the network in its initial state with two neurons connected through a mother neuron. In a cycle we repeat as many epoch as necessary to reach convergence (error below a given threshold). When a cycle ends, the network grows up through the node with the highest variability (which do not necessarily correspond to the more populated node). Then this node becomes “mother” node and generates two new daughter nodes. Then the process is repeated.. B), C), D) and E) represent different consecutive steps of growing of the SOTA algorithm. | ||

+ | |||

+ | **SOTA structures grow from the root of the tree, toward the leaves, producing a representation of the data from lower to higher resolution**. If the growth of the network is not restricted, the outcome would be a binary tree which contains only one profile in each leave. Nevertheless, if a threshold of resource value has been fixed, once all terminal nodes fall below such a threshold, the expansion of the binary topology will be stopped. This allows the generation of a hierarchical cluster structure at the desired level of resolution (see Herrero et al., 2001 for details) | ||

+ | |||

+ | |||

+ | ===== Clustering with k-means ===== | ||

+ | **K-means** clustering (Hartigan 1979) is a **simple and widely used partitioning method** for data analysis. Its helpfulness in discovering groups of coexpressed genes has been demonstrated. **The number of clusters k in the data is needed as an input for the algorithm**, which then initialises the mean vector for each of the k clusters either by hard assignment (e.g., from the input) or random generation. These initial mean vectors are called the seeds. Next, The iterative procedure of the k-means algorithm, consisting of the following two steps, begins. (1) Using the given mean vectors, the algorithm assigns each genes (or experiments) to the cluster with the closest mean vector. (2) The algorithm recalculates the mean vectors (which are the sample means) for all the clusters. The iterative procedure converges when all the mean vectors of the clusters remain stationary.A big problem associated with k-means algorithm, shared with SOM, is the **arbitrariness of predefine of the number of clusters**, since it is difficult to predict the number of clusters in advance. In practice, this makes it necessary to use a trial-and-error approach where a comparison and biological validation of several runs of the algorithm with different parameter settings are necessary (Moreau et al., 2002)]]. Another parameter that will influence the result of k-means clustering is the choice of the seeds. The algorithm suffers from the problem of local-minimum. This means that **with different seeds, the algorithm can yield very different result**. Preferably, the seeds should be chosen close to the centre of the natural clusters. Of course, this is hard to achieve if no prior knowledge about the clusters is available, which is often the case. | ||

+ | |||

+ | {{:images:clustering:clustering_figure9.jpg |Figure 9 }} | ||

+ | |||

+ | Figure 9. **Dynamics of clustering by k-means**. Seeds are introduced and the mean change in successive iterations. | ||

+ | |||

+ | |||

+ | ===== Some caveats on clustering methods ===== | ||

+ | |||

+ | The **relative performances** of several methods have been compared in the literature. In general, **different methods perform distinctly in different situations**, but, in general, **average link, k-means, SOTA and SOM tend to perform robustly**. The figures below, taken from Handl et al. 2005, show some benchmarking. | ||

+ | |||

+ | {{ :images:clustering:clustering_figure10.jpg | Figure 10 }} | ||

+ | |||

+ | Figure 10. Clustering results on a **synthetic dataset** for (from left to right) k-means, average link, single link, SOM and SOTA for (top) k = 2 clusters and (bottom) k = 6 clusters.(Figure taken from Handl et al. 2005) | ||

+ | |||

+ | {{ :images:clustering:clustering_figure11.jpg?800px |Figure 11 }} | ||

+ | |||

+ | Figure 11. Adjusted Rand Index, Silhouette Width, Dunn Index and stability (averages over 21 runs) for k-means, SOM, SOTA, average link and single link agglomerative clustering on the **Leukemia test set**. The evaluation under the adjusted Rand Index (comparing to the known class labels) shows that average link, k-means, SOTA and SOM perform robustly on these data. They identify the three main clusters (AML, B-lineage ALL and T-lineage ALL), and assign most of the samples correctly. Naturally, this is knowledge that would not be available in a real-life cluster analysis, and it is therefore interesting to see whether the results under the internal validation measures would have led to the same conclusion. The performance curves under the Silhouette Width clearly indicate the high quality of the three-cluster solution. This result is best seen when comparing with the results obtained under the null model and ‘removing’ the bias towards large numbers of clusters, which arises due to the small size of the dataset: for large numbers of clusters, the resulting partitionings contain many singleton clusters, which score highly under the Silhouette Width. The stability-based technique is less consistent: for k-means and SOM, the performance peak at k = 3 is well pronounced, but it is much weaker for SOTA and average link. Both the Silhouette Width and the stability-based method indicate the lack of structure in the single link solutions. The application of the Dunn Index is somewhat less successful: it fails to predict the insufficiency of single link, and it mis-estimates the number of clusters for average link. | ||

+ | |||

+ | Interesting revisions can be found in: D’haeseler (2005) and Handl et al. (2005). | ||

+ | |||

+ | |||

+ | |||

+ | ====== Worked Examples ====== | ||

+ | |||

+ | ===== Example 1. Fibroblasts K-means clustering ===== | ||

+ | |||

+ | - Go to the Babelomics page and select the Clustering option from the //Expression// menu. | ||

+ | - Press //Online Examples//, select the **example number 1** and you will see how the parameters and form fields are now filled. As you can notice, this example is prepared to perform a clustering analysis on genes(rows) and conditions(columns) using the K-means algorithm with 5 sample-clusters and 15 gene-clusters. Here, the selected distance is Euclidean (square). | ||

+ | - Press run, and wait for your job to be finished. | ||

+ | - When the process finishes, a new //green job// is shown at the right side of the web page. Press it to check your results. | ||

+ | |||

+ | ** Questions ** | ||

+ | |||

+ | These are some questions that you should be able to answer about the previous example: | ||

+ | |||

+ | * Do you think that the clustering was able to differentiate any group of coexpressed genes? | ||

+ | * How many sample clusters are there? and gene clusters? | ||

+ | |||

+ | Launch **online examples number 3** (Fibroblasts SOTA clustering) and **number 4** (Fibroblasts UPGMA clustering). Compare the results. | ||

+ | |||

+ | * Do you obtain the same result? | ||

+ | * Which are the differences between the results of these three examples? | ||

+ | * Why are they different? | ||

+ | |||

+ | |||

+ | |||

+ | ===== Example 2. Rheumatoid SOTA clustering ===== | ||

+ | |||

+ | - Go to the Babelomics page and select the Clustering option from the //Expression// menu. | ||

+ | - Press //Online Examples//, select the **example number 2** and you will see how the parameters and form fields are now filled. As you can notice, this example is prepared to perform a clustering analysis on genes(rows) and conditions(columns) using the SOTA algorithm and Euclidean (square) distance. | ||

+ | - Press run, and wait for your job to be finished. | ||

+ | - When the process finishes, a new //green job// is shown at the right side of the web page. Press it to check your results. | ||

+ | |||

+ | ** Questions ** | ||

+ | |||

+ | These are some questions that you should be able to answer about the previous example: | ||

+ | |||

+ | * Do you think that the clustering was able to differentiate any group of coexpressed genes? | ||

+ | * How many groups/clusters? | ||

+ | * What is your answer based on? | ||

+ | * Do your selected groups represent different functional classes? | ||

+ | |||

+ | Try to use other distance and clustering methods by selecting different options from the Babelomics interface. Compare the results. | ||

+ | |||

+ | * Do you obtain the same result? | ||

+ | * Which is the main difference between the hierarchical and non-hierarchical results? | ||

+ | * Does the distance method affect to your results? | ||

+ | |||

+ | |||

+ | |||

+ | |||

+ | ====== Exercises ====== | ||

+ | |||

+ | ===== Exercise 1. Random dataset ===== | ||

+ | |||

+ | Download this {{example_data:clustering:random_array.txt|random dataset}} and perform a clustering analysis. | ||

+ | |||

+ | * What would we obtain for an analysis of data with no structure? | ||

+ | * Do you obtain a result? | ||

+ | * What can you say about this result? | ||

+ | |||

+ | ===== Exercise 2. Response of human fibroblasts to serum ===== | ||

+ | |||

+ | Download {{example_data:clustering:fibro.txt|this dataset}} and perform a clustering analysis. | ||

+ | |||

+ | This dataset was explored in detail in [[http://genome-www.stanford.edu/serum/clusters.html|Iyer et al. (1999) study]]. A functional validation was made on detected clusters. You can perform the same clustering analysis and take a look from the biological interpretation that was made of the different clusters. | ||

+ | |||

+ | Select a cluster and click on the highlighted region. Continue your analysis sending it to [[enrichment_analysis|Fatigo]] to compare them against the rest of genome. | ||

+ | |||

+ | * Do you obtain any interesting results? If not, you can try with another cluster of genes. | ||

+ | ===== Exercise 3. Zebrafish embryogenesis data ===== | ||

+ | |||

+ | Download {{:example_data:clustering:zebrafish_embryo.txt|this file}} and perform a hierarchical clustering analysis of its genes. This example file contains the first 999 genes of the 3,657 genes that showed significant levels of differential expression in [[http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.0010029|Mathavan et al. study (2005)]]. | ||

+ | |||

+ | * Do you see any patterns of gene expression between different developmental stages? | ||

+ | * Are gene clusters of different developmental stages functionally enriched? | ||

+ | |||

+ | |||

+ | ===== Exercise 4. Golub data ===== | ||

+ | * Run a differential expression analysis with the **Golub data** (train dataset from Predictor exercise [[predictors|Predictors]]): | ||

+ | \\ | ||

+ | {{:images:clustering:diff_expr_golub.jpg|}} | ||

+ | |||

+ | \\ | ||

+ | * Then, redirect the differentially expressed genes to clustering: | ||

+ | \\ | ||

+ | {{:images:clustering:golub_redirect_clustering.jpg|}} | ||

+ | |||

+ | * Do the samples clusters make sense? are samples of the some conditions clustered together? Does the genes cluster make sense? Are functionally related? | ||

+ | \\ | ||

+ | \\ | ||

+ | |||

+ | ====== References ====== | ||

+ | * Dopazo, J. (2007) Clustering - Class discovery in the post-genomic era in Fundamentals of data mining in genomics and proteomics. Springer-Verlag, New York Eds. W. Dubitzky, M. Granzow and D.P. Berrar [[http://bioinfo.cipf.es/node/477|link to paper]] | ||

+ | * D’haeseler, P. (2005) How does gene expression clustering work? Nat. Biotech. 23:1499-1501. | ||

+ | * Dopazo, J. and Carazo, J.M. (1997) Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. J Mol Evol, 44, 226-233 | ||

+ | * Handl, J., Knowles J and Kell D.B. (2005) Computational cluster validation in post-genomic data analysis 21:3201–3212 | ||

+ | * Hartigan, J. and Wong, M. (1979) A k-means clustering algorithm. Applied Statistics, 28, 100-108 | ||

+ | * Herrero, J., Valencia, A. and Dopazo, J. (2001) A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17, 126-136. | ||

+ | * Iyer et al., (1999) The transcriptional program in the response of human fibroblasts to serum. Science 283:83-87 | ||

+ | * Kohonen, T. (1997) Self-organizing maps. Springer-Verlag, Berlin | ||

+ | * Moreau Y., De Smet F., Thijs G., Marchal K., and De Moor B( 2002) Functional Bioinformatics of microarray data: from expression to regulation. Proceedings of the IEEE, 90(11):1722–1743 | ||

+ | * Sneath, P. and Sokal, R. (1973) Numerical Taxonomy. W. H. Freeman, San Francisco | ||

+ | * Mathavan S, Lee SGP, Mak A, Miller LD, Murthy KRK, et al. (2005) Transcriptome Analysis of Zebrafish Embryogenesis Using Microarrays. PLoS Genet 1(2): e29. |