Quality Control

Preliminaries

Software used in this practical:

  • FastQC: A quality control tool for high-throughput sequencing data.
  • cutadapt: A tool to remove adapter sequences from high-throughput sequencing data.

File formats explored in this practical

FASTQ. See:

Overview

  1. User FastQC to explore the raw data.
  2. User cutadapt to remove adapters.
  3. Use cutadapt to filter reads based on quality.
  4. Use FastQC to explore the filtered data.

Exercise

Create an empty directory to work in the exercise :

mkdir quality_control
cd quality_control

Donwload the raw data to it (you can use the next command or download the file from this link)

wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/quality_control/f010_raw_mirna.fastq

Explore the raw data using some Linux shell commands

The file f010_raw_mirna.fastq contains reads from a microRNA sequencing experiment. Use the command head to have a view of the first lines of the file:

head f010_raw_mirna.fastq

Use the command wc to count how many reads there are in the file (remember you have to divide by 4)

wc -l f010_raw_mirna.fastq

Explore the raw data quality using FastQC

First create a directory to store the results of the fastqc analysis:

mkdir f020_res_fastqc

Then execute fastqc storing the results in the created directory (option -o):

fastqc -o f020_res_fastqc f010_raw_mirna.fastq

Find the results in the fastqc_report.html file and discus them. There are many Overrepresented sequences. Explore whether some of them correspond to miRNAs using the miRBase (SearchBy sequence).

Handling adapters

There are 2 known adapters used in this experiment:

CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA
CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC

Use the command grep to see whether they are still present in your data:

grep "CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA" f010_raw_mirna.fastq 
grep "CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC" f010_raw_mirna.fastq 

Do the sequences appear systematically at the beginning or at the end of the reads?

But the adapters could also appear in the reverse, complementary or reverse complementary mode.

Compute the reverse, complementary and the reverse complementary sequences of the two adapters, and find out which of them appear in your data.

To compute those sequences you can use some online resources as the one in:

Use grep form Linux shell to find out which of the versions of the adapter is in your data.

Adapter 1

grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f010_raw_mirna.fastq | wc -l  ## present in the sample (at the beginning of the reads)
 
grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCTAAACCCTTAGAATATTCAAGACATACTCTGGTGAGATTTTT f010_raw_mirna.fastq | wc -l 
 
grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f010_raw_mirna.fastq | wc -l  ## present in the sample (at the end of the read) ... but not so numerous
 
grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAAGGGTTTAGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f010_raw_mirna.fastq | wc -l 

But sometimes the adapter does not appear complete. It may be there just the first part:

grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGA f010_raw_mirna.fastq | wc -l 
 
grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCT f010_raw_mirna.fastq | wc -l 
 
grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATT f010_raw_mirna.fastq | wc -l 
 
grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAA f010_raw_mirna.fastq | wc -l

or the end part of it:

grep AATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f010_raw_mirna.fastq | wc -l 
 
grep TTAGAATATTCAAGACATACTCTGGTGAGATTTTT f010_raw_mirna.fastq | wc -l 
 
grep TCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f010_raw_mirna.fastq | wc -l 
 
grep AGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f010_raw_mirna.fastq | wc -l 

NOTE: in the code above I did cut just the 35 first or last nucleotides of the primer in its different arrangements, but this is an arbitrary length. We are just trying to discover which of the arrangements are present in our sample and whether there are allocated in the 5’ or in the 6’ end.

Adapter 2

grep CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f010_raw_mirna.fastq | wc -l   ## present in the sample (at the end of the read) ... but not so numerous
 
grep GAAAAAAAGCAGGAAAGGTGTTCTATATATTTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f010_raw_mirna.fastq | wc -l 
 
grep GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f010_raw_mirna.fastq | wc -l   ## present in the sample (at the beginning of the reads)
 
grep CGAATGGCATTGAACTTTCATAAAGCTAAAGAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f010_raw_mirna.fastq | wc -l 

As before, sometimes the adapter does not appear complete. It may be there just the first part:

grep CTTTTTTTCGTCCTTTCCACAAGATATATA f010_raw_mirna.fastq | wc -l 
 
grep GAAAAAAAGCAGGAAAGGTGTTCTATATAT f010_raw_mirna.fastq | wc -l 
 
grep GCTTACCGTAACTTGAAAGTATTTCGATTT f010_raw_mirna.fastq | wc -l 
 
grep CGAATGGCATTGAACTTTCATAAAGCTAAA f010_raw_mirna.fastq | wc -l

or the end part of it:

grep AAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f010_raw_mirna.fastq | wc -l 
 
grep TTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f010_raw_mirna.fastq | wc -l 
 
grep CTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f010_raw_mirna.fastq | wc -l 
 
grep GAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f010_raw_mirna.fastq | wc -l 

Use cutadapt to make an adapter trimming of the reads

Check the options:

  • -a for adapter to the 3' end.
  • -g for adapter to the 5' end.

You can find the help of the program typing in the shell:

cutadapt -h

To get read of the the adapters found in our data we run cutadapt several times:

cutadapt -g CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA -o f030_mirna_trim1.fastq f010_raw_mirna.fastq
 
cutadapt -a TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG -o f030_mirna_trim2.fastq f030_mirna_trim1.fastq
 
cutadapt -g GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG         -o f030_mirna_trim3.fastq f030_mirna_trim2.fastq
 
cutadapt -a CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC         -o f030_mirna_trim4.fastq f030_mirna_trim3.fastq

Now you can grep again searching for the adapters

Adapter 1

grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f030_mirna_trim4.fastq | wc -l

#grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCTAAACCCTTAGAATATTCAAGACATACTCTGGTGAGATTTTT f030_mirna_trim4.fastq | wc -l 

grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f030_mirna_trim4.fastq | wc -l

#grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAAGGGTTTAGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f030_mirna_trim4.fastq | wc -l 

Adapter 2

grep CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f030_mirna_trim4.fastq | wc -l

#grep GAAAAAAAGCAGGAAAGGTGTTCTATATATTTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f030_mirna_trim4.fastq | wc -l 

grep GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f030_mirna_trim4.fastq | wc -l

#grep CGAATGGCATTGAACTTTCATAAAGCTAAAGAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f030_mirna_trim4.fastq | wc -l 

Explore the quality of the trimmed file using FastQC

Check the data quality again using fastqc:

mkdir f040_res_fastqc_trimmed
fastqc -o f040_res_fastqc_trimmed f030_mirna_trim4.fastq

Some of the reads seems to be too short and some others may not have enough quality.

Use cutadapt to filter reads by quality and length

Check the options:

  • -q quality cutoff.
  • -m minimum length.
  • -M maximum length.

You can find the help of the program typing in the shell:

cutadapt -h

Run cutadapt for length and quality purge of the reads.

cutadapt -m 17 -M 30 -q 10 -o f040_mirna_cut.fastq f030_mirna_trim4.fastq

Check the data quality again using fastqc:

mkdir f050_res_fastqc_trimmed_purged

fastqc -o f050_res_fastqc_trimmed_purged f040_mirna_cut.fastq

Explore again the Overrepresented sequences in miRBase (SearchBy sequence).

Count how many reads are left for the analysis (divide by 4)

wc -l f010_raw_mirna.fastq

wc -l f040_mirna_cut.fastq

NOTE: Before continuing with the next step you should go back to your working directory:

cd ..

Mapping

Preliminaries

In this hands-on will learn how to align DNA and RNA-seq data with most widely used software today. Building a whole genome index requires a lot of RAM memory and almost one hour in a typical workstation, for this reason in this tutorial we will work with chromosome 21 to speed up the exercises. The same steps would be done for a whole genome alignment. Two different datasets, high and low quality have been simulated for DNA, high quality contains 0.1% of mutations and low quality contains 1%. For RNA-seq a 100bp and 150bp datasets have been simulated.

NGS aligners used:

  • BWA: BWA is a software package for mapping DNA low-divergent sequences against a large reference genome, such as the human genome. The new project repository is available at GitHub BWA.

Other software used in this hands-on:

  • SAMTools: SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
  • dwgsim (optional): dwgsim can perform whole genome simulation.

File formats explored:

  • SAM: Sequence alignment format, plain text.
  • BAM: Binary and compressed version of SAM.

Data used in this practical

Create a mapping folder in your working directory and download the reference genome sequence to be used (human chromosome 21) and simulated datasets from HERE.

mkdir mapping
cd mapping

Download reference genome from Ensembl

Working with NGS data requires a high-end workstations and time for building the reference genome indexes and alignment. During this tutorial we will work only with chromosome 21 to speed up the runtimes. You can download it from the Download link at the top of Ensembl website and then to Download data via FTP. We are going to work with Esembl GRCh37 revision 75 and you can download the chromosome 21 from ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/. The next command will download the chr 21.

wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/f000_chr21_ref_genome_sequence.fa

Or you can use the next commmand to download this file from

NOTE: For working with the whole reference genome the file to be downloaded is Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

Download simulated datasets

For this hands-on we are going to use small DNA datasets simulated from chr 21. Data has been already simulated using dwgsim softeare from SAMtools. You can download all the files with the next commands. The gunzip * command will decompress all files.

mkdir data
cd data
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_hq_read1.fastq.gz
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_hq_read2.fastq.gz
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_lq_read1.fastq.gz
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_lq_read2.fastq.gz
gunzip *
cd ..

Notice that the name of the folders and files describe the dataset, ie. dna_chr21_100_hq stands for: DNA type of data from chromosome 21 with 100nt read lengths of high quality. Where hq quality means 0.1% mutations and lq quality 1% mutations. Take a few minutes to understand the different files.

Real Datasets

For those with access to high-end nodes clusters you can index and simulated whole genome datasets or download real datasets from this sources:

Exercise

In this exercise we’ll learn how to download, install, build the reference genome index and align in single-end and paired-end mode with BWA. But first, we need to create a folder to store the alignment results and an index folder to store the genome index, you can create both folders by executing:

mkdir results
mkdir index

NOTE: Now your working directory must contain 3 folders: data, index and results. Your working directory should be similar to this, execute ‘tree’:

tree
.
├── data
│   ├── dna_chr21_100_hq_read1.fastq
│   ├── dna_chr21_100_hq_read2.fastq
│   ├── dna_chr21_100_lq_read1.fastq
│   └── dna_chr21_100_lq_read2.fastq
├── f000_chr21_ref_genome_sequence.fa
├── index
└── results

BWA

BWA is probably the most used aligner for DNA. AS the documentation states it consists of three different algorithms: BWA, BWA-SW and BWA-MEM. The first algorithm, which is the oldest, is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA for 70-100bp Illumina reads.

Build the index

First we have to move our reference genome to the index folder by executing:

mv f000_chr21_ref_genome_sequence.fa index/

Now we can create the index by executing:

bwa index index/f000_chr21_ref_genome_sequence.fa

Some files will be created in the index folder, those files constitute the index that BWA uses. NOTE: The index must created only once, it will be used for all the different alignments with BWA.

Aligning with new BWA-MEM in both single-end (SE) and paired-end (PE) modes

BWA-MEM is the recommended algorithm to use now. You can check the options by executing:

bwa mem

To align SE (Single-end) with BWA-MEM execute:

bwa mem -t 4 -R "@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1" index/f000_chr21_ref_genome_sequence.fa data/dna_chr21_100_hq_read1.fastq > results/dna_chr21_100_hq_se.sam

Now you can use SAMtools to create the BAM file. First we move to results folder and then we execute samtools command to export SAM file to BAM file:

cd results
samtools view -bS dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam

You can explore the SAM file with the alignments using the next command (if you press up/down keys you can navigate through the file.

more dna_chr21_100_hq_se.sam

Before the next command we should go back to main directory.

cd ..

To align PE (Paired-end) with BWA-MEM just execute the same command line with the two FASTQ files:

bwa mem -t 4 -R "@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1" index/f000_chr21_ref_genome_sequence.fa  data/dna_chr21_100_hq_read1.fastq data/dna_chr21_100_hq_read2.fastq > results/dna_chr21_100_hq_pe.sam

Now you can use SAMtools to create the BAM file (as we did before):

cd results
samtools view -bS dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam

If you want to explore the new SAM file with paired-end alignments you can execute:

more dna_chr21_100_hq_pe.sam

Now you can do the same for the low quality datasets.

NOTE: Before continuing with the next step you should go back to your working directory:

cd ../..

Data visualization

Preliminaries

  • IGV: The Integrative Genomics Viewer is a program for reading several types of indexed database information, including mapped reads and variant calls, and displaying them on a reference genome. It is invaluable as a tool for viewing and interpreting the “raw data” of many NGS data analysis pipelines.
  • samtools: SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.

File formats explored:

  • SAM: Sequence alignment format, plain text.
  • BAM: Binary and compressed version of SAM.

Exercise 1: Visualising sequencing data

Create a folder visualization and download de required files:

mkdir visualization
cd visualization
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12878_child.bam
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12891_dad.bam
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12892_mom.bam

These datasets contain reads only for the GABBR1 gene.

Creating indexed files

Use samtools to index the bam files:

samtools index NA12878_child.bam
samtools index NA12891_dad.bam
samtools index NA12892_mom.bam

Notice that three files have been created (with extension .bai):

ls -1
NA12878_child.bam
NA12878_child.bam.bai
NA12891_dad.bam
NA12891_dad.bam.bai
NA12892_mom.bam
NA12892_mom.bam.bai

Run IGV

You can run this command from the terminal using igv:

igv

Download a reference genome

Run just in case you do not have downloaded Human hg10 genome before:

  • Go to GenomesLoad Genome From Server…Select Human hg19

Loading and browsing files

  • Go to FileLoad from file…Select NA12878_child.bam, NA12891_dad.bam and NA12892_mom.bam

Steps:

  1. Enter the name of our gene (GABBR1) in the search box and hit Go.
  2. Zoom in until you find some SNPs - they might be in exons or introns.
  3. Identify at least one example of a short insertion variant and deletion around exon 4.
  4. Load and look at the SNP track: FileLoad from server…AnnotationsVariants and RepeatsdbSNP

NOTE: Before continuing with the next step you should go back to your working directory:

cd ..

Variant Calling

Preliminaries

Software used in this practical

  • SAMTools: SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
  • Picard: Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files.
  • GATK : Genome Analysis Toolkit - A package to analyse next-generation re-sequencing data, primary focused on variant discovery and genotyping.

File formats explored:

  • SAM: Sequence alignment format, plain text.
  • BAM: Binary and compressed version of SAM.
  • VCF: Variant Call Format.

Exercise 1: Variant calling with paired-end data

Create a directory for this exercise:

mkdir calling
cd calling

Prepare reference genome: generate the fasta file index

Create a new directory called genome:

mkdir genome

Now we can copy the genome from the previous step (mapping):

cp ../mapping/index/f000_chr21_ref_genome_sequence.fa genome/

Or we can download the file with the next command:

wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/f000_chr21_ref_genome_sequence.fa -P genome/

NOTE: we used option -P this time with wget, this option allows us to indicate the output folder.

Use samtools to generate the fasta file index:

samtools faidx genome/f000_chr21_ref_genome_sequence.fa

This creates a file called samtools faidx f000_chr21_ref_genome_sequence.fa.fai, with one record per line for each of the contigs in the FASTA reference file.

Generate the sequence dictionary using Picard.

picard CreateSequenceDictionary REFERENCE=genome/f000_chr21_ref_genome_sequence.fa OUTPUT=genome/f000_chr21_ref_genome_sequence.dict

Prepare BAM file

Create a data folder and download the required BAM by executing the next commands:

mkdir data
cd data
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/000-dna_chr21_100_hq_pe.bam

NOTE: this BAM file is the same file we created in the maping section.

We must sort and index the BAM file before processing it with Picard and GATK. To sort the bam file we use samtools

samtools sort -o 001-dna_chr21_100_hq_pe_sorted.bam 000-dna_chr21_100_hq_pe.bam

Index the BAM file:

samtools index 001-dna_chr21_100_hq_pe_sorted.bam

Mark duplicates (using picard)

Run the following Picard command to mark duplicates:

picard MarkDuplicates INPUT=001-dna_chr21_100_hq_pe_sorted.bam OUTPUT=002-dna_chr21_100_hq_pe_sorted_noDup.bam METRICS_FILE=002-metrics.txt

This creates a sorted BAM file called 002-dna_chr21_100_hq_pe_sorted_noDup.bam with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called metrics.txt.

Run the following Picard command to index the new BAM file:

picard BuildBamIndex INPUT=002-dna_chr21_100_hq_pe_sorted_noDup.bam

Local realignment around INDELS (using GATK)

There are 2 steps to the realignment process:

First, create a target list of intervals which need to be realigned

gatk -T RealignerTargetCreator -R ../genome/f000_chr21_ref_genome_sequence.fa -I 002-dna_chr21_100_hq_pe_sorted_noDup.bam -o 003-indelRealigner.intervals

Second, perform realignment of the target intervals

gatk -T IndelRealigner -R ../genome/f000_chr21_ref_genome_sequence.fa -I 002-dna_chr21_100_hq_pe_sorted_noDup.bam -targetIntervals 003-indelRealigner.intervals -o 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam

This creates a file called 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam containing all the original reads, but with better local alignments in the regions that were realigned.

Base quality score recalibration (using GATK)

Two steps:

First, analyse patterns of covariation in the sequence dataset.

It is imperative that you provide the program with a set of known sites, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.

Download the known sites file:

wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/000-dbSNP_chr21.vcf -P ..

Execute tye GATK command:

gatk -T BaseRecalibrator -R ../genome/f000_chr21_ref_genome_sequence.fa -I 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam -knownSites ../000-dbSNP_chr21.vcf -o 004-recalibration_data.table

This creates a GATKReport file called 004-recalibration_data.table containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.

NOTE: this step could take several minutes, if you do not want to wait until the comand finish you can download the result file by executing the next command:

wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-recalibration_data.table

Second, apply the recalibration to your sequence data

gatk -T PrintReads -R ../genome/f000_chr21_ref_genome_sequence.fa -I 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam -BQSR 004-recalibration_data.table -o 004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam

NOTE: this step could take several minutes, if you do not want to wait until the comand finish you can download the result files by executing the next commands:

wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bai
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam

This creates a file called 004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag –emit_original_quals to the PrintReads command, in which case the original qualities will also be written in the file, tagged OQ.

Variant calling (using GATK - UnifiedGenotyper)

SNPs are called using the next command:

gatk -T UnifiedGenotyper -R ../genome/f000_chr21_ref_genome_sequence.fa -I 004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam -o 005-dna_chr21_100_he_pe.vcf

NOTE: this step could take several minutes, if you do not want to wait until the comand finish you can download the result files by executing the next commands:

wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/005-dna_chr21_100_he_pe.vcf
wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/005-dna_chr21_100_he_pe.vcf.idx

Introduce filters in the VCF file

Example: filter SNPs with low confidence calling (QD < 12.0) and flag them as “LowConf”.

gatk -T VariantFiltration -R ../genome/f000_chr21_ref_genome_sequence.fa -V 005-dna_chr21_100_he_pe.vcf --filterExpression "QD < 12.0" --filterName "LowConf" -o 006-dna_chr21_100_he_pe_filtered.vcf

The command –filterExpression will read the INFO field and check whether variants satisfy the requirement. If a variant does not satisfy your filter expression, the field FILTER will be filled with the indicated –filterName. These commands can be called several times indicating different filtering expression (i.e: –filterName One –filterExpression “X < 1” –filterName Two –filterExpression “X > 2”).

QUESTION: How many “LowConf” variants have we obtained?

grep LowConf 006-dna_chr21_100_he_pe_filtered.vcf | wc -l