Activity 2
Objectives: Supervised Classification for two experimental groups
Data:
- TCGA_265_mod_gene_LUSC_train.txt: RNA-Seq data of Lung squamous cell carcinoma (LUSC) samples taken from The Cancer Genome Atlas (TCGA) data portal. Contains 11 Normal and 150 Tumor samples. NOTE: It is always recommended to use balanced sample size to avoid biased training.
- TCGA_265_mod_gene_LUSC_test.txt: RNA-Seq data of Lung squamous cell carcinoma (LUSC) samples taken from The Cancer Genome Atlas (TCGA) data portal. Contains 6 Normal and 75 Tumor samples.
Workflow + questions:
- Explore both files from a text editor:
- How many genes and samples are there for each file?
- Any specific detail about headers?
- Upload your files to Babelomics 5.0. Go to section Expression > Class Prediction
- Select these parameters:
- SVM, KNN and Random Forest
- Leave-one-out for error estimation
- Correlation-based Feature Selection (CFS)
- Download test_result.txt and answer these questions:
- Which supervised classification method(s) works better?
- How many genes were used for the prediction?
- Are the selected genes same for all methods?