Preprocessing Data-Matrix Methods

We present a tool for preprocessing microarray gene expression or SNP data. The purpose of this step is to shape your data in a distribution which will be suitable in further steps of the analysis. It analyses the data, suggests the most appropriate transformations and proceeds with them after user agreement. The normal preprocessing steps include:

  • Scale Logarithmic transformations
    Microarray data is usually evaluated by looking at ratios. This can be the ratio between two conditions on the same array, or the ratio between the absolute values from single-dye experiments. By doing this, systematic errors of the spot are divided away. However, although ratios provide an intuitive measure of expression changes, they have the disadvantage of treating up and down regulated genes differently. Genes upregulated by a factor of 2 have an expression ratio of 2, whereas those downregulated by the same factor have and expression ratio of 0.5. This will result in a graph where the upregulated genes have a much wider range than the downregulated genes. The result of a log transformation is that positively skewed data is transformed into a more symmetrical data distribution around 0 (usually creating a normal division). This means that a graph is created where up- and downregulated genes are treated in similar fashion, both using a similar part of the graph.
  • Replicate handling
    Usually the data matrix contains replicated measures. For further analysis these replicated measurements have to be transformed into single measures. Here we present a module for merge replicates where the final value could be the average or the median.
  • Management of missing values
    • Filter missing values if the percentage of existing values is below a Minimum (%)
    • Impute missing values and fill them with 0, row average, row median, or KNN impute method. KNN impute is a standard missing value imputation method that takes advantage of the correlation structure in microarray data by selecting genes with expression profiles similar to the gene of interest to impute missing values. The KNN method is relatively sensitive when K-value is in the range of 10-20 neighbours.
  • Filter genes by names
    In this step we can replace our platform probe IDs for other gene names or IDs. Gene names list have to be provided.

Processed data set can be sent to other pattern analysis tools.

datamatrix_methods.txt · Last modified: 2017/05/24 10:36 (external edit)
Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0 do yourself a favour and use a real browser - get firefox!!