Microarray Normalization
DNA microarray technologies allow for the simultaneous measurement of thousands of genomic features, such as gene expression, copy number variation or SNP variant. The accuracy and reproducibility of microarray measurements has been extensively validated in the past years. Despite of that, in the measurements of any microarray set, there are always technological artifacts that may hide the true biological signal. Such possible distortion of the microarray measurements may produce signal effects within each arrays but also across different arrays in the set.
Causes of non biological variation in microarray measurements include, of course, differences in the sample preparation and the hybridization process, but also, dye biases, cross-hybridization and scanner differences.
The goal of normalization is to adjust for the effects that are due to variations in the technology rather than the biology.
When starting a new microarray analysis, first of all we have to explore the raw data in order to detect artifacts which we may want to correct. The easiest way of assessing such artifacts is by looking at some plots of the raw data. Then, we will transform the raw data aiming to correct such undesirable effects. After the transformation we will need to check if the modified data, the normalized data in microarray terminology, are free of the original artifacts that we wanted to remove. Again, this assessment will be done using the same plots we used for the raw data interpretation. Here we will check that the undesired characteristics in the plots are gone.
There are two general assumptions about the data up on which Babelomics' normalization methods relay on:
This assumptions seem reasonable in general microarray experiments that record whole genome expression data. Nevertheless researchers should revise their particular experimental context before using the methodologies here presented.
What this hypothesis basically imply in statistical terms is that, if there where no technical artifact in the data, there should be no general trend or pattern in the gene differences between any two samples of the data set. Most normalization algorithms exploit this fact by first, fitting the trend of the raw data and second, correcting the data for such trend.
Thus seems clear that, in some point of the normalization process, all the arrays in the dataset must be treated together. The general rule of microarray data processing is to normalize together all the microarrays that are going to be analyzed together later.
Originally, microarrays where used under the two color schema of a competitive hybridization. Here, two biological samples labeled with different fluorescent dyes, are hybridized into the same array slide. The attachment of the genomic material to the glass keeps the proportion of the molecular concentrations in the samples. Thus, the intensities measured in the two channels represent the abundance of molecules in one sample relative to the other one.
Generally in two color studies, the logarithm of the ratio of the two intensity channels (log ratio) is reported as a summary of the differences between two samples. If the above mentioned assumptions hold, the statistical distribution of log ratios of each array will be center in zero and the variability across arrays will be similar.
Many microarray applications still use this two color approach but also, some newer microarray technologies are hybridized using a single channel protocol. They use only one type of dye to label each sample which is hybridized on its own in the array. In this approach, each microarray yields intensity measurements that represent absolute abundance of molecules for a unique biological sample.
There are many different microarray platforms or manufacturers. Each of them uses its own technology and design to build up its microarrays and suggests different hybridization protocols. Hence, microarrays from different platforms have particularities that will need to be taken into account by the normalization algorithms.
But not only the platform defines the data we get from the microarray. The scanner used to read the chip and the image processing software is what determines the final raw data format. Some manufacturers like Affymetrix or Agilent provide their own scanner besides the microarray slides. Other platforms (including home made) produce microarrays to be read in general purpose scanners like those of GenePix.
Hence, first think we will require from a normalization software is that it is able to read the format of our raw data files.
Babelomics can read 5 file formats from 3 different platforms (or, more appropriately, from 3 different scanners):
In Babelomics (as in general microarray contexts) we consider such files to be the raw data of the microarray experiment; the starting point of the data analysis process. However, this does not mean you have to normalize your data using Babelomics. If you have already pre-processed data you can input the normalized values into Babelomics to perform further steps in your analysis.
There are four general steps to be followed in microarray data preprocessing. Not all of them may be necessary in all contexts; for instance the Within Array Correction is not necessary, even meaningful in one color arrays.
This steps are described below in a particular order which is useful to understand what has to be done when normalizing microarray data. But indeed, mots methodologies or algorithms have an effect in several of them.
The aim of this step is to correct for what is usually known as the background effect. That is, any source of technical variation reflected in an spatial pattern of the intensity measurements.
Probes or features are usually randomly scattered in the surface of the microarray. Therefore, there is no biological reason to expect such spatial effect or trend in the intensity measurements; it may be caused then, by irregularities in the glass surface, differences in the hybridization efficiency, array washing problems or scanner effects.
In two color microarrays the background effect may affect differently to each of the intensity channels. Hence, the baccground correction has to deal separately with each of the colors. Similarly, the background effect will differ between arrays in the experiment, therefore, each array needs its own correction. Nevertheless, this does not mean that the algorithms used in the background correction are going to deal with the arrays one at a time. Some background correction methodologies, like RMA use information form all the arrays in the experiment in order to correct each of them. Hence the final corrected values for one array will change if it is normalized within a different dataset.
Some array platforms like Agilent or other spotted arrays scanned using GenePix, provide a local background estimate for each of the features. In the background correction step, such information is used by the normalization algorithms to do a first correction of the foreground measurements. This first correction affects to each feature (within each color) independently of the other ones.
Affymetrix arrays design do not have a background estimate for each probe but instead a mismatch probe (MM) or set of probes to correct for cross-hybridization or non-specific binding. Background correction algorithms may take advantage of such features in their action to do background correction.
In two color microarray technologies, two biological samples, each of them labeled with a different dye, are hybridized into the same chip. Ideally, the ratios of the two intensities are representative of the ratio concentration of the genes in both samples.
But differences in the processing of the two samples, in the eficiency of the dying or in the scanner reading the red or the green channel may end up distorting such ratios. A dye bias.
Dye bias correction deals with non biological differences in the two channel intensities of each array. It is the firs aim of the within array correction but there is also a second purpose of summarization in this step. The two color signals of each gene or feature in the array are merged into a unique measurement. This is achieved by computing the log ratio of the two intensity measurements. This log ratio value is generally called M-value.
After log ratio transformation the M-values should have a distribution centered around zero. This is used by some normalization algorithms like the loess normalization to be able to fit the trend of the noise and to correct for it. This transformation relies up on the general assumption that a similar amount of genes will have increased or decreased expression levels in on channel related to the other.
In this step measurements from all microarray are rescaled into a unique final distribution. This is necessary in order to get data from different samples calibrated one to each other. Otherwise any analysis done will be meaningless.
Generally a consensus distribution is defined from the dataset and then, data from each array is transformed into that distribution. This is the basis, for instance of the quantile distribution.
If wee succeed to remove die bias in two color arrays, then, all array median values should be centered in zero. Then the only thing that the Between Array Scaling will do is to standardize their variabilities.
This is the final step (at least conceptually) in the pre-processing of microarray data.
In general array designs there may be several probes or spots designed to hybridize with the same gene, transcript or biological feature. There may also be control spots designed for quality checking, background signal estimation or to measure cross hybridization.
In this final step, array intensities are summarized in a final measurement relating each biological feature of interest in the study. If for instance we are doing an experiment at a gene level, all probes matching a gene will be somehow averaged in a unique number, reflecting the expression of the gene. If we are investigating at exon level, the probes of the array will be summarized for each exon.
Also in this final step control spots in the array will be removed so only biological measurements remain in the normalized data.
A good example to show that the steps above mentioned are not always performed in the described order will be to say that, generally, control spots are remooved at the very beginning of the analysis, even before background correction so they do not influence further transformation of the data.
Babelomics stores normalized intensity measurements from all arrays into a unique normalized data matrix. In this matrix genes are arranged by rows and arrays (or experiments) are ordered in columns.
This matrix can be downloaded in a tab delimited text file or redirected to some other Babelomics' modules for further analysis.
In order to asses how well the normalization has perform in your data some plots are also provided with the normalized data. This plots are devised to represent array data distribution and can be used to compare datasets normalized using different methodologies. They can also be used to compare normalized data with raw data and see how much the normalization reduced the noise.
General plots provided after normalization will be:
In the Example Datasets section of this tutorial you can find several raw data files from different platforms. Download them and have a go with the tool.