Objective: Cluster Analysis for samples (arrays, individuals,….)
Data: RNA-Seq data of 30 Breast Invasive Carcinoma (BRCA) samples taken from The Cancer Genome Atlas (TCGA) data portal. Contains 10 normal samples, 20 tumor samples with 2 subtypes (Basal-like and Her2-enriched).
Workflow + questions:
Open the file “TCGA_265_mod_gene_BRCA_subtype_HER_Basal_Normal.txt” in a text editor and inspect its content. How many genes do we have? Are clear samples for each group?
Upload your file to Babelomics 5.0. Go to section Expression>Clustering
Cluster samples for different scenarios:
UPGMA + Euclidean (square)
UPGMA + Correlation coeff. (Spearman)
Which distance parameter is better for proper clustering?
Repeat the analysis using the same distance parameters and SOTA method:
SOTA + Euclidean (square)
SOTA + Correlation coeff. (Spearman)
Do the results change based on the method or the distance parameter?
Try to cluster your samples with K-means:
Set k-value 6 and use Correlation coeff. (Spearman)
Check the results of K-means.
Are the results acceptable?
Is the dendrogram representing any hierarchy between the samples?
Repeat the previous step with k-value 3:
Did your result same as previous one?
Try to cluster your samples with K-means:
Set k-value 2 and use Correlation coeff. (Spearman).
Can we say that K-means is good to distinguish tumor from normal?