Activity 2

Objectives: Supervised Classification for two experimental groups

Data:

  • TCGA_265_mod_gene_LUSC_train.txt: RNA-Seq data of Lung squamous cell carcinoma (LUSC) samples taken from The Cancer Genome Atlas (TCGA) data portal. Contains 11 Normal and 150 Tumor samples. NOTE: It is always recommended to use balanced sample size to avoid biased training.
  • TCGA_265_mod_gene_LUSC_test.txt: RNA-Seq data of Lung squamous cell carcinoma (LUSC) samples taken from The Cancer Genome Atlas (TCGA) data portal. Contains 6 Normal and 75 Tumor samples.

Workflow + questions:

  1. Explore both files from a text editor:
    • How many genes and samples are there for each file?
    • Any specific detail about headers?
  2. Upload your files to Babelomics 5.0. Go to section Expression > Class Prediction
  3. Select these parameters:
    • SVM, KNN and Random Forest
    • Leave-one-out for error estimation
    • Correlation-based Feature Selection (CFS)
  4. Download test_result.txt and answer these questions:
    • Which supervised classification method(s) works better?
    • How many genes were used for the prediction?
    • Are the selected genes same for all methods?