Activity 2

Objective: Cluster Analysis for samples (arrays, individuals,….)

Data: RNA-Seq data of 30 Breast Invasive Carcinoma (BRCA) samples taken from The Cancer Genome Atlas (TCGA) data portal. Contains 10 normal samples, 20 tumor samples with 2 subtypes (Basal-like and Her2-enriched).

Workflow + questions:

  1. Open the file “TCGA_265_mod_gene_BRCA_subtype_HER_Basal_Normal.txt” in a text editor and inspect its content. How many genes do we have? Are clear samples for each group?
  2. Upload your file to Babelomics 5.0. Go to section Expression>Clustering
  3. Cluster samples for different scenarios:
    1. UPGMA + Euclidean (square)
    2. UPGMA + Correlation coeff. (Spearman)
    3. Which distance parameter is better for proper clustering?
  4. Repeat the analysis using the same distance parameters and SOTA method:
    1. SOTA + Euclidean (square)
    2. SOTA + Correlation coeff. (Spearman)
    3. Do the results change based on the method or the distance parameter?
  5. Try to cluster your samples with K-means:
    1. Set k-value 6 and use Correlation coeff. (Spearman)
    2. Check the results of K-means.
    3. Are the results acceptable?
    4. Is the dendrogram representing any hierarchy between the samples?
  6. Repeat the previous step with k-value 3:
    1. Did your result same as previous one?
  7. Try to cluster your samples with K-means:
    1. Set k-value 2 and use Correlation coeff. (Spearman).
    2. Can we say that K-means is good to distinguish tumor from normal?