# Association tests

The purpose of this set of tests is to study association between genetic markers and phenotype or traits. In general, the idea of population association studies is to identify patterns of polymorphisms that vary systematically between individuals with different disease states and could therefore represent the effects of risk-enhancing or protective alleles.

The statistical determination of how associated the genotype and phenotype are, it can be analysed with different tests that we propose in this section, where the use of one test or other principally depends on the type of incoming data.

(More details are explained below this introduction for every specific test).

## Chi-square case/control

The chi-square test statistic is designed to test the null hypothesis that there is no association between the rows and columns of a contingency table. For example, to determine whether there is an association between a particular SNP variant and phenotype (case/control) might collect data that could be assembled into a 2×2 table. In this case, the two columns could be defined by whether the subject have a disease (case) or not (control), while the rows represent the two variant of an allele SNP. The cells of the table would contain the number of observations or patients as defined by these two variables.

For every SNP, the chi-square test statistic builds a 2×2 contingency table by counting the number of times each possible allele SNP appears in a case or control sample. We check if there is difference between the allele proportion presence on the phenotype variable (case and control).

This statistic is calculated by the sum of observed minus expected count squared and divided by the expected. When the observed number of events deviates significantly from the expected counts, then it is unlikely that the null hypothesis is true, and it is likely that there is a row-column association. Conversely, a small chi-square value indicates that the observed values are similar to the expected values leading us to conclude that the null hypothesis is plausible.

In terms of pvalues, a chi-square probability of .05 or less is interpreted as justification for rejecting the null hypothesis that the row variable is unrelated to the column variable.

Example: Observed values for data presented in a 2×2 contingency table (columns represent phenotype, rows genotype)

Case Control Total
allele A a b a+b
allele T c d c+d
Total a+c b+d n

HINT: When there is a small number of counts in the table, the use of the chi-square test statistic may not be appropriate. Specifically, it has been recommended that this test not be used if any cell in the table has an expected count of less than one, or if 20 percent of the cells have an expected count that is greater than five. Under this scenario, the Fisher's exact test is recommended for conducting tests of hypothesis.

## Fisher's exact test

The purpose of this test (fisher's exact) is similar than the chi-square test studying association between genotype and disease trait (phenotype) with the use of contingency tables, but a difference of the chi-square test we use the fisher's exact test when the sample sizes are small.

We provide p-values and adjusted (corrected p-values) to check the significance on to test the null hypothesis that there is not association between variables…

In the results file …

## Linear Model

The linear model allows for multiple covariates when testing for both quantitative trait and disease trait SNP association, and for interactions with those covariates. The covariates can either be continuous or binary (i.e. for categorical covariates, you must first make a set of binary dummy variables).

This test is implemented making a call to plink (whole genome association analysis toolset).

## Logistic test

The logistic model also allows for multiple covariates as the linear model. But in this case, the logistic model is useful when the covariates are binary.

This test is implemented making a call to plink (whole genome association analysis toolset). 