Activity 1

Goal: We would like to generate a predictor to classify patients as ALL (Acute Lymphoblastic Leukemia ) or AML (Acute Myeloid Leukemia ).

Data:

  • datatraingolub.txt: microarray expression data related to two different kind of leukemia. 30 arrays: 22 ALL and 9 AML. NOTE: It is always recommended to use balanced sample size to avoid biased training.
  • datatestgolub.txt: microarray expression data for several individuals to classify. Contains 6 ALL and 2 AML samples. In order to check the accuracy of prediction you can see the correct labels for the test file:
     ALL	ALL	ALL	ALL	ALL	ALL	AML	AML 

Workflow:

  1. Explore both files from a text editor:
    • How many genes and samples are there for each file? Are the same genes for both files?
    • Any specific detail about headers?
  2. Upload your files to Babelomics 5.0. Go to section Expression > Class Prediction
  3. Select these parameters:
    • Algorithm: KNN
    • Error estimation: KFold. Repeats: 10; folds:5
    • Correlation-based Feature Selection (CFS)

Questions:

  1. Train results:
    • The summary includes three interesting tables + summary plot. Could you explain the meaning for each of them?
    • How many genes were used for the prediction?
    • Are there any samples with more difficult to classify?
  2. Test results:
    • Could you comment final results for the group of new individuals?