# Presentation

# Web usage

# Data Management

# Data Preprocessing

# Expression Data Analysis

# Genomic Data Analysis

# Functional Profiling Analysis

Table of Contents

The purpose of this set of tests is to study association between genetic markers and phenotype or traits. In general, the idea of population association studies is to identify patterns of polymorphisms that vary systematically between individuals with different disease states and could therefore represent the effects of risk-enhancing or protective alleles.

The statistical determination of how associated the genotype and phenotype are, it can be analysed with different tests that we propose in this section, where the use of one test or other principally depends on the type of incoming data.

(More details are explained below this introduction for every specific test).

The **chi-square** test statistic is designed **to test** the null hypothesis that there is no **association** between the rows and columns of a contingency table. For example, to determine whether there is an association between a particular SNP variant and phenotype (case/control) might collect data that could be assembled into a 2×2 table. In this case, the two columns could be defined by whether the subject have a disease (case) or not (control), while the rows represent the two variant of an allele SNP. The cells of the table would contain the number of observations or patients as defined by these two variables.

For every SNP, the chi-square test statistic builds a 2×2 contingency table by counting the number of times each possible allele SNP appears in a case or control sample. We check if there is difference between the allele proportion presence on the phenotype variable (case and control).

This statistic is calculated by the sum of observed minus expected count squared and divided by the expected. When the observed number of events deviates significantly from the expected counts, then it is unlikely that the null hypothesis is true, and it is likely that there is a row-column association. Conversely, a small chi-square value indicates that the observed values are similar to the expected values leading us to conclude that the null hypothesis is plausible.

In terms of pvalues, a chi-square probability of .05 or less is interpreted as justification for rejecting the null hypothesis that the row variable is unrelated to the column variable.

**Example:**
Observed values for data presented in a 2×2 contingency table (columns represent phenotype, rows genotype)

Case | Control | Total | |
---|---|---|---|

allele A | a | b | a+b |

allele T | c | d | c+d |

Total | a+c | b+d | n |

**HINT**: When there is a **small number of counts** in the table, the use of the **chi-square test statistic may not be appropriate**. Specifically, it has been recommended that this test not be used if any cell in the table has an expected count of less than one, or if 20 percent of the cells have an expected count that is greater than five. Under this scenario, the *Fisher's exact* test is recommended for conducting tests of hypothesis.

The purpose of this test (fisher's exact) is similar than the chi-square test studying **association between genotype** and **disease trait** (phenotype) with the use of contingency tables, but a difference of the chi-square test we use the fisher's exact test when the **sample sizes are small**.

We provide p-values and adjusted (corrected p-values) to check the significance on to test the null hypothesis that there is not association between variables…

In the results file …

The **linear model** allows for multiple covariates when testing for both **quantitative trait** and disease trait SNP association, and for interactions with those covariates. The **covariates** can either be **continuous** or binary (i.e. for categorical covariates, you must first make a set of binary dummy variables).

This test is implemented making a call to plink (whole genome association analysis toolset).

For more info about the properties of this test go to plink: linear model.

The **logistic model** also allows for multiple covariates as the linear model. But in this case, the logistic model is useful when the **covariates are binary**.

This test is implemented making a call to plink (whole genome association analysis toolset).

For more info about the properties of this test go to plink: logistic model.

The purpose of this test (transmission/disequilibrium test [TDT]) is to consider basic family-based association testing for disease traits. This test is useful when using data from families with at least one affected child, we evaluate the transmission of the associated marker allele from a heterozygous parent to an affected offspring.

Spielman R.S., McGinnis R.E., Ewens W.J. (1993) “Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM)”. Am J Hum Genet. 1993 March; 52(3): 506–516.