datatraingolub.txt: microarray expression data related to two different kind of leukemia. 30 arrays: 22 ALL and 9 AML. NOTE: It is always recommended to use balanced sample size to avoid biased training.
datatestgolub.txt: microarray expression data for several individuals to classify. Contains 6 ALL and 2 AML samples. In order to check the accuracy of prediction you can see the correct labels for the test file:
ALL ALL ALL ALL ALL ALL AML AML
Workflow:
Explore both files from a text editor:
How many genes and samples are there for each file? Are the same genes for both files?
Any specific detail about headers?
Upload your files to Babelomics 5.0. Go to section Expression > Class Prediction
Select these parameters:
Algorithm: KNN
Error estimation: KFold. Repeats: 10; folds:5
Correlation-based Feature Selection (CFS)
Questions:
Train results:
The summary includes three interesting tables + summary plot. Could you explain the meaning for each of them?
How many genes were used for the prediction?
Are there any samples with more difficult to classify?
Test results:
Could you comment final results for the group of new individuals?