Activity 1

Goal: We would like to generate a predictor to classify patients as ALL (Acute Lymphoblastic Leukemia ) or AML (Acute Myeloid Leukemia ).

Data:

datatraingolub.txt: microarray expression data related to two different kind of leukemia. 30 arrays: 22 ALL and 9 AML. NOTE: It is always recommended to use balanced sample size to avoid biased training.
datatestgolub.txt: microarray expression data for several individuals to classify. Contains 6 ALL and 2 AML samples. In order to check the accuracy of prediction you can see the correct labels for the test file:
```
 ALL	ALL	ALL	ALL	ALL	ALL	AML	AML 
```

Workflow:

Explore both files from a text editor:
- How many genes and samples are there for each file? Are the same genes for both files?
- Any specific detail about headers?
Upload your files to Babelomics 5.0. Go to section Expression > Class Prediction
Select these parameters:
- Algorithm: KNN
- Error estimation: KFold. Repeats: 10; folds:5
- Correlation-based Feature Selection (CFS)

Questions:

Train results:
- The summary includes three interesting tables + summary plot. Could you explain the meaning for each of them?
- How many genes were used for the prediction?
- Are there any samples with more difficult to classify?
Test results:
- Could you comment final results for the group of new individuals?