Activity 2
Objective: Cluster Analysis for samples (arrays, individuals,….)
Data: RNA-Seq data of 30 Breast Invasive Carcinoma (BRCA) samples taken from The Cancer Genome Atlas (TCGA) data portal. Contains 10 normal samples, 20 tumor samples with 2 subtypes (Basal-like and Her2-enriched).
Workflow + questions:
- Open the file “TCGA_265_mod_gene_BRCA_subtype_HER_Basal_Normal.txt” in a text editor and inspect its content. How many genes do we have? Are clear samples for each group?
- Upload your file to Babelomics 5.0. Go to section Expression>Clustering
- Cluster samples for different scenarios:
- UPGMA + Euclidean (square)
- UPGMA + Correlation coeff. (Spearman)
- Which distance parameter is better for proper clustering?
- Repeat the analysis using the same distance parameters and SOTA method:
- SOTA + Euclidean (square)
- SOTA + Correlation coeff. (Spearman)
- Do the results change based on the method or the distance parameter?
- Try to cluster your samples with K-means:
- Set k-value 6 and use Correlation coeff. (Spearman)
- Check the results of K-means.
- Are the results acceptable?
- Is the dendrogram representing any hierarchy between the samples?
- Repeat the previous step with k-value 3:
- Did your result same as previous one?
- Try to cluster your samples with K-means:
- Set k-value 2 and use Correlation coeff. (Spearman).
- Can we say that K-means is good to distinguish tumor from normal?