Activity 1
Goal: We would like to generate a predictor to classify patients as ALL (Acute Lymphoblastic Leukemia ) or AML (Acute Myeloid Leukemia ).
Data:
- datatraingolub.txt: microarray expression data related to two different kind of leukemia. 30 arrays: 22 ALL and 9 AML. NOTE: It is always recommended to use balanced sample size to avoid biased training.
- datatestgolub.txt: microarray expression data for several individuals to classify. Contains 6 ALL and 2 AML samples. In order to check the accuracy of prediction you can see the correct labels for the test file:
ALL ALL ALL ALL ALL ALL AML AML
Workflow:
- Explore both files from a text editor:
- How many genes and samples are there for each file? Are the same genes for both files?
- Any specific detail about headers?
- Upload your files to Babelomics 5.0. Go to section Expression > Class Prediction
- Select these parameters:
- Algorithm: KNN
- Error estimation: KFold. Repeats: 10; folds:5
- Correlation-based Feature Selection (CFS)
Questions:
- Train results:
- The summary includes three interesting tables + summary plot. Could you explain the meaning for each of them?
- How many genes were used for the prediction?
- Are there any samples with more difficult to classify?
- Test results:
- Could you comment final results for the group of new individuals?