Activity 2

Objective: Cluster Analysis for samples (arrays, individuals,….)

Data: RNA-Seq data of 30 Breast Invasive Carcinoma (BRCA) samples taken from The Cancer Genome Atlas (TCGA) data portal. Contains 10 normal samples, 20 tumor samples with 2 subtypes (Basal-like and Her2-enriched).

Workflow + questions:

Open the file “TCGA_265_mod_gene_BRCA_subtype_HER_Basal_Normal.txt” in a text editor and inspect its content. How many genes do we have? Are clear samples for each group?
Upload your file to Babelomics 5.0. Go to section Expression>Clustering
Cluster samples for different scenarios:
1. UPGMA + Euclidean (square)
2. UPGMA + Correlation coeff. (Spearman)
3. Which distance parameter is better for proper clustering?
Repeat the analysis using the same distance parameters and SOTA method:
1. SOTA + Euclidean (square)
2. SOTA + Correlation coeff. (Spearman)
3. Do the results change based on the method or the distance parameter?
Try to cluster your samples with K-means:
1. Set k-value 6 and use Correlation coeff. (Spearman)
2. Check the results of K-means.
3. Are the results acceptable?
4. Is the dendrogram representing any hierarchy between the samples?
Repeat the previous step with k-value 3:
1. Did your result same as previous one?
Try to cluster your samples with K-means:
1. Set k-value 2 and use Correlation coeff. (Spearman).
2. Can we say that K-means is good to distinguish tumor from normal?