Differences

This shows you the differences between two versions of the page.

Link to this comparison view

ngs_tutorial [2016/12/12 09:26]
ngs_tutorial [2017/05/24 14:23] (current)
Line 1: Line 1:
 +====== Quality Control =====
 +
 +===== Preliminaries =====
 +==== Software used in this practical: ====
 +  * //[[http://www.bioinformatics.babraham.ac.uk/projects/fastqc|FastQC]]//: A quality control tool for high-throughput sequencing data.
 +  * //[[http://code.google.com/p/cutadapt|cutadapt]]//: A tool to remove adapter sequences from high-throughput sequencing data.
 +
 +==== File formats explored in this practical ====
 +**FASTQ**. See:
 +    * [[http://en.wikipedia.org/wiki/FASTQ_format | Wikipedia]]
 +    * [[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217 | NAR 2010]]
 +
 +===== Overview =====
 +  - User FastQC to explore the raw data.
 +  - User cutadapt to remove adapters.
 +  - Use cutadapt to filter reads based on quality.
 +  - Use FastQC to explore the filtered data.
 +===== Exercise =====
 +Create an empty directory to work in the exercise :
 +<code bash>
 +mkdir quality_control
 +cd quality_control
 +</code>
 +Donwload the raw data to it (you can use the next command or download the file from this [[http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/quality_control/f010_raw_mirna.fastq|link]])
 +<code bash>
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/quality_control/f010_raw_mirna.fastq
 +</code>
 +==== Explore the raw data using some Linux shell commands ====
 +The file f010_raw_mirna.fastq contains reads from a microRNA sequencing experiment. Use the command **head** to have a view of the first lines of the file:
 +<code bash>
 +head f010_raw_mirna.fastq
 +</code>
 +Use the command **wc** to count how many reads there are in the file (remember you have to divide by 4)
 +<code bash>
 +wc -l f010_raw_mirna.fastq
 +</code>
 +
 +==== Explore the raw data quality using FastQC =====
 +First create a directory to store the results of the fastqc analysis:
 +<code bash>
 +mkdir f020_res_fastqc
 +</code>
 +Then execute **fastqc** storing the results in the created directory (option **-o**):
 +<code bash>
 +fastqc -o f020_res_fastqc f010_raw_mirna.fastq
 +</code>
 +Find the results in the **fastqc_report.html** file and discus them.
 +There are many //Overrepresented sequences//. Explore whether some of them correspond to miRNAs using the [[http://www.mirbase.org/search.shtml|miRBase]] (//Search// -> //By sequence//).
 +
 +==== Handling adapters ====
 +There are 2 known adapters used in this experiment:
 +
 +<code>
 +CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA
 +CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC
 +</code>
 +
 +Use the command grep to see whether they are still present in your data:
 +
 +<code bash>
 +grep "CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA" f010_raw_mirna.fastq 
 +grep "CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC" f010_raw_mirna.fastq 
 +</code>
 +
 +Do the sequences appear systematically at the beginning or at the end of the reads?
 +
 +But the adapters could also appear in the //reverse//, //complementary// or //reverse complementary// mode.
 +
 +Compute the reverse, complementary and the reverse complementary sequences of the two adapters, and find out which of them appear in your data.
 +
 +To compute those sequences you can use some online resources as the one in:
 +
 +Use grep form Linux shell to find out which of the versions of the adapter is in your data.
 +
 +=== Adapter 1 ===
 +<code bash>
 +grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f010_raw_mirna.fastq | wc -l  ## present in the sample (at the beginning of the reads)
 +
 +grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCTAAACCCTTAGAATATTCAAGACATACTCTGGTGAGATTTTT f010_raw_mirna.fastq | wc -l 
 +
 +grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f010_raw_mirna.fastq | wc -l  ## present in the sample (at the end of the read) ... but not so numerous
 +
 +grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAAGGGTTTAGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f010_raw_mirna.fastq | wc -l 
 +</code>
 +
 +But sometimes the adapter does not appear complete. It may be there just the first part:
 +
 +<code bash>
 +grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGA f010_raw_mirna.fastq | wc -l 
 +
 +grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCT f010_raw_mirna.fastq | wc -l 
 +
 +grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATT f010_raw_mirna.fastq | wc -l 
 +
 +grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAA f010_raw_mirna.fastq | wc -l
 +</code>
 +
 +or the end part of it:
 +
 +<code bash>
 +grep AATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f010_raw_mirna.fastq | wc -l 
 +
 +grep TTAGAATATTCAAGACATACTCTGGTGAGATTTTT f010_raw_mirna.fastq | wc -l 
 +
 +grep TCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f010_raw_mirna.fastq | wc -l 
 +
 +grep AGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f010_raw_mirna.fastq | wc -l 
 +</code>
 +
 +NOTE: in the code above I did cut just the 35 first or last nucleotides of the primer in its different arrangements, but this is an arbitrary length. We are just trying to discover which of the arrangements are present in our sample and whether there are allocated in the 5’ or in the 6’ end.
 +
 +
 +=== Adapter 2 ===
 +
 +<code bash>
 +grep CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f010_raw_mirna.fastq | wc -l   ## present in the sample (at the end of the read) ... but not so numerous
 +
 +grep GAAAAAAAGCAGGAAAGGTGTTCTATATATTTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f010_raw_mirna.fastq | wc -l 
 +
 +grep GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f010_raw_mirna.fastq | wc -l   ## present in the sample (at the beginning of the reads)
 +
 +grep CGAATGGCATTGAACTTTCATAAAGCTAAAGAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f010_raw_mirna.fastq | wc -l 
 +</code>
 +
 +
 +As before, sometimes the adapter does not appear complete. It may be there just the first part:
 +
 +
 +
 +<code bash>
 +grep CTTTTTTTCGTCCTTTCCACAAGATATATA f010_raw_mirna.fastq | wc -l 
 +
 +grep GAAAAAAAGCAGGAAAGGTGTTCTATATAT f010_raw_mirna.fastq | wc -l 
 +
 +grep GCTTACCGTAACTTGAAAGTATTTCGATTT f010_raw_mirna.fastq | wc -l 
 +
 +grep CGAATGGCATTGAACTTTCATAAAGCTAAA f010_raw_mirna.fastq | wc -l 
 +
 +</code>
 +
 +or the end part of it:
 +
 +<code bash>
 +grep AAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f010_raw_mirna.fastq | wc -l 
 +
 +grep TTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f010_raw_mirna.fastq | wc -l 
 +
 +grep CTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f010_raw_mirna.fastq | wc -l 
 +
 +grep GAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f010_raw_mirna.fastq | wc -l 
 +</code>
 +
 +==== Use cutadapt to make an adapter trimming of the reads ====
 +
 +Check the options:
 +    * **-a** for adapter to the //3'// end.
 +    * **-g** for adapter to the //5'// end.
 +
 +You can find the help of the program typing  in the shell:
 +<code bash>
 +cutadapt -h
 +</code>
 +
 +To get read of the the adapters found in our data we run **cutadapt** several times:
 +
 +<code bash>
 +cutadapt -g CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA -o f030_mirna_trim1.fastq f010_raw_mirna.fastq
 +
 +cutadapt -a TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG -o f030_mirna_trim2.fastq f030_mirna_trim1.fastq
 +
 +cutadapt -g GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG         -o f030_mirna_trim3.fastq f030_mirna_trim2.fastq
 +
 +cutadapt -a CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC         -o f030_mirna_trim4.fastq f030_mirna_trim3.fastq
 +</code>
 +
 +Now you can **grep** again searching for the adapters
 +
 +=== Adapter 1 ===
 +
 +<code>
 +grep CTGGGAAATCACCATAAACGTGAAATGTCTTTGGATTTGGGAATCTTATAAGTTCTGTATGAGACCACTCTAAAAA f030_mirna_trim4.fastq | wc -l
 +
 +#grep GACCCTTTAGTGGTATTTGCACTTTACAGAAACCTAAACCCTTAGAATATTCAAGACATACTCTGGTGAGATTTTT f030_mirna_trim4.fastq | wc -l 
 +
 +grep TTTTTAGAGTGGTCTCATACAGAACTTATAAGATTCCCAAATCCAAAGACATTTCACGTTTATGGTGATTTCCCAG f030_mirna_trim4.fastq | wc -l
 +
 +#grep AAAAATCTCACCAGAGTATGTCTTGAATATTCTAAGGGTTTAGGTTTCTGTAAAGTGCAAATACCACTAAAGGGTC f030_mirna_trim4.fastq | wc -l 
 +</code>
 +
 +=== Adapter 2 ===
 +<code>
 +grep CTTTTTTTCGTCCTTTCCACAAGATATATAAAGCCAAGAAATCGAAATACTTTCAAGTTACGGTAAGC f030_mirna_trim4.fastq | wc -l
 +
 +#grep GAAAAAAAGCAGGAAAGGTGTTCTATATATTTCGGTTCTTTAGCTTTATGAAAGTTCAATGCCATTCG f030_mirna_trim4.fastq | wc -l 
 +
 +grep GCTTACCGTAACTTGAAAGTATTTCGATTTCTTGGCTTTATATATCTTGTGGAAAGGACGAAAAAAAG f030_mirna_trim4.fastq | wc -l
 +
 +#grep CGAATGGCATTGAACTTTCATAAAGCTAAAGAACCGAAATATATAGAACACCTTTCCTGCTTTTTTTC f030_mirna_trim4.fastq | wc -l 
 +</code>
 +
 +==== Explore the quality of the trimmed file using FastQC ====
 +Check the data quality again using **fastqc**:
 +<code>
 +mkdir f040_res_fastqc_trimmed
 +fastqc -o f040_res_fastqc_trimmed f030_mirna_trim4.fastq
 +</code>
 +
 +Some of the reads seems to be too short and some others may not have enough quality.
 +
 +==== Use cutadapt to filter reads by quality and length ====
 +
 +Check the options:
 +    * **-q** quality cutoff.
 +    * **-m** minimum length.
 +    * **-M** maximum length.
 +You can find the help of the program typing  in the shell:
 +<code bash>
 +cutadapt -h
 +</code>
 +
 +
 +Run cutadapt for length and quality purge of the reads.
 +
 +<code>
 +cutadapt -m 17 -M 30 -q 10 -o f040_mirna_cut.fastq f030_mirna_trim4.fastq
 +</code>
 +
 +Check the data quality again using fastqc:
 +<code>
 +mkdir f050_res_fastqc_trimmed_purged
 +
 +fastqc -o f050_res_fastqc_trimmed_purged f040_mirna_cut.fastq
 +</code>
 +
 +Explore again the Overrepresented sequences in [[http://www.mirbase.org/search.shtml|miRBase]] (//Search// -> //By sequence//).
 +
 +
 +Count how many reads are left for the analysis (divide by 4)
 +
 +<code>
 +wc -l f010_raw_mirna.fastq
 +
 +wc -l f040_mirna_cut.fastq
 +</code>
 +
 +**NOTE**: Before continuing with the next step you should go back to your **working directory**:
 +<code bash>
 +cd ..
 +</code>
 +
 +====== Mapping ======
 +===== Preliminaries =====
 +In this hands-on will learn how to align DNA and RNA-seq data with most widely used software today. Building a whole genome index requires a lot of RAM memory and almost one hour in a typical workstation, for this reason **in this tutorial we will work with chromosome 21** to speed up the exercises. The same steps would be done for a whole genome alignment. Two different datasets, high and low quality have been simulated for DNA, high quality contains 0.1% of mutations and low quality contains 1%. For RNA-seq a 100bp and 150bp datasets have been simulated.
 +
 +
 +==== NGS aligners used: ====
 +  * //[[http://bio-bwa.sourceforge.net|BWA]]//: //BWA//  is a software package for mapping **DNA** low-divergent sequences against a large reference genome, such as the human genome. The new project repository is available at [[https://github.com/lh3/bwa|GitHub BWA]].
 +
 +/* * //[[https://github.com/opencb/hpg-aligner/wiki|HPG Aligner]]//: //HPG Aligner//  is a new NGS aligner for mapping both **DNA Genomic** and **RNA-seq** data against a large reference genome. It's has been designed for having a high sensitivity and performance.*/
 +/* * //[[http://bowtie-bio.sourceforge.net/bowtie2/index.shtml|Bowtie2]]//: //Bowtie2// is an ultrafast and memory-efficient tool for aligning **DNA** sequencing reads to long reference sequences.*/
 +/*   * //[[http://ccb.jhu.edu/software/tophat/index.shtml |TopHat2]]//: //TopHat2// is a fast splice junction mapper for **RNA-Seq** reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.*/
 +/*   * //[[https://code.google.com/p/rna-star/|STAR]]//: //STAR// aligns **RNA-seq** reads to a reference genome using uncompressed suffix arrays.*/
 +
 +==== Other software used in this hands-on: =====
 +    * //[[http://www.htslib.org/ |SAMTools]]//: SAM Tools **provide various utilities** for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
 +    * //[[ http://sourceforge.net/apps/mediawiki/dnaa/index.php?title=Whole_Genome_Simulation|dwgsim (optional)]]//: dwgsim can perform whole **genome simulation**.
 +==== File formats explored: ====
 +    * //[[http://samtools.sourceforge.net/SAMv1.pdf|SAM]]//: Sequence alignment format, plain text.
 +    * //[[ http://www.broadinstitute.org/igv/bam |BAM]]//: Binary and compressed version of SAM.
 +==== Data used in this practical ====
 +
 +Create a **mapping** folder in your **working directory** and download the **reference genome sequence** to be used (human chromosome 21) and //simulated datasets// from HERE.
 +
 +<code bash>
 +mkdir mapping
 +cd mapping
 +</code>
 +
 +=== Download reference genome from Ensembl ===
 +Working with NGS data requires a high-end workstations and time for building the reference genome indexes and alignment. During this tutorial we will work only with chromosome 21 to speed up the runtimes. You can download it from the //Download// link at the top of [[http://www.ensembl.org/index.html|Ensembl]] website and then to Download data via FTP. We are going to work with //Esembl GRCh37 revision 75// and you can download the chromosome 21 from [[ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/]]. The next command will download the chr 21.
 +<code bash>
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/f000_chr21_ref_genome_sequence.fa
 +</code>
 +
 +Or you can use the next commmand to download this file from 
 +
 +**NOTE**: For working with the whole reference genome the file to be downloaded is //Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz//
 +
 +=== Download simulated datasets ===
 +For this hands-on we are going to use small DNA datasets simulated from chr 21. Data has been already simulated using //dwgsim// softeare from SAMtools. You can download all the files with the next commands. The **gunzip * ** command will decompress all files.
 +
 +<code bash>
 +mkdir data
 +cd data
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_hq_read1.fastq.gz
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_hq_read2.fastq.gz
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_lq_read1.fastq.gz
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/alignment/dna_chr21_100_lq_read2.fastq.gz
 +gunzip *
 +cd ..
 +</code>
 +
 +Notice that the name of the folders and files describe the dataset, ie. **dna_chr21_100_hq** stands for: DNA type of data from chromosome 21 with 100nt read lengths of high quality. Where hq quality means 0.1% mutations and lq quality 1% mutations. Take a few minutes to understand the different files.
 +
 +=== Real Datasets ===
 +For those with access to high-end nodes  clusters you can index and simulated whole genome datasets or download real datasets from this sources:
 +    *  [[ http://www.1000genomes.org/ | 1000 genomes project ]]
 +    *  [[https://www.ebi.ac.uk/ena/ | European Nucleotide Archive (ENA)]]
 +    *  [[http://www.ncbi.nlm.nih.gov/sra | Sequence Read Archive (SRA)]]
 +
 +
 +=====  Exercise =====
 +
 +In this exercise we’ll learn how to download, install, build the reference genome index and align in single-end and paired-end mode with BWA. But first, we need to create a folder to store the alignment results and an index folder to store the genome index, you can create both folders by executing:
 +
 +<code bash>
 +mkdir results
 +mkdir index
 +</code>
 +
 +**NOTE**: Now your working directory must contain 3 folders: data, index and results. Your working directory should be similar to this, execute ‘tree’:
 +
 +<code bash>
 +tree
 +.
 +├── data
 +│   ├── dna_chr21_100_hq_read1.fastq
 +│   ├── dna_chr21_100_hq_read2.fastq
 +│   ├── dna_chr21_100_lq_read1.fastq
 +│   └── dna_chr21_100_lq_read2.fastq
 +├── f000_chr21_ref_genome_sequence.fa
 +├── index
 +└── results
 +
 +</code>
 +
 +==== BWA ====
 +
 +[[http://bio-bwa.sourceforge.net/|BWA]] is probably the most used aligner for DNA. AS the documentation states it consists of three different algorithms: BWA, BWA-SW and BWA-MEM. The first algorithm, which is the oldest, is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences. BWA-MEM and BWA-SW share similar features such as long-read support and split alignment, but BWA-MEM, which is the latest, is generally recommended for high-quality queries as it is faster and more accurate. BWA-MEM also has better performance than BWA for 70-100bp Illumina reads.
 +
 +
 +=== Build the index ===
 +
 +First we have to move our reference genome to the index folder by executing:
 +<code bash>
 +mv f000_chr21_ref_genome_sequence.fa index/
 +</code>
 +Now we can create the index by executing:
 +<code bash>
 +bwa index index/f000_chr21_ref_genome_sequence.fa
 +</code>
 +
 +Some files will be created in the **index** folder, those files constitute the index that BWA uses.
 +**NOTE**: The index must created only once, it will be used for all the different alignments with BWA.
 +
 +
 +
 +=== Aligning with new BWA-MEM in both single-end (SE) and paired-end (PE) modes ===
 +
 +BWA-MEM is the recommended algorithm to use now. You can check the options by executing:
 +
 +<code bash>
 +bwa mem
 +</code>
 +
 +To align **SE** (Single-end) with BWA-MEM execute:
 +
 +<code bash>
 +bwa mem -t 4 -R "@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1" index/f000_chr21_ref_genome_sequence.fa data/dna_chr21_100_hq_read1.fastq > results/dna_chr21_100_hq_se.sam
 +</code>
 +
 +
 +Now you can use SAMtools to create the BAM file. First we move to **results** folder and then we execute **samtools** command to export SAM file to BAM file:
 +<code bash>
 +cd results
 +samtools view -bS dna_chr21_100_hq_se.sam -o dna_chr21_100_hq_se.bam
 +</code>
 +
 +You can explore the SAM file with the alignments using the next command (if you press up/down keys you can navigate through the file.
 +<code bash>
 +more dna_chr21_100_hq_se.sam
 +</code>
 +
 +Before the next command we should go back to main directory.
 +<code bash>
 +cd ..
 +</code>
 +
 +To align **PE** (Paired-end) with BWA-MEM just execute the same command line with the two FASTQ files:
 +
 +<code bash>
 +bwa mem -t 4 -R "@RG\tID:foo\tSM:bar\tPL:Illumina\tPU:unit1\tLB:lib1" index/f000_chr21_ref_genome_sequence.fa  data/dna_chr21_100_hq_read1.fastq data/dna_chr21_100_hq_read2.fastq > results/dna_chr21_100_hq_pe.sam
 +</code>
 +
 +
 +Now you can use SAMtools to create the BAM file (as we did before):
 +
 +<code bash>
 +cd results
 +samtools view -bS dna_chr21_100_hq_pe.sam -o dna_chr21_100_hq_pe.bam
 +
 +</code>
 +
 +If you want to explore the new SAM file with paired-end alignments you can execute:
 +<code bash>
 +more dna_chr21_100_hq_pe.sam
 +</code>
 +
 +Now you can do the same for the **low** quality datasets.
 +
 +
 +**NOTE**: Before continuing with the next step you should go back to your **working directory**:
 +
 +<code bash>
 +cd ../..
 +</code>
 +
 +
 +====== Data visualization ======
 +
 +===== Preliminaries  =====
 +    * //[[http://www.broadinstitute.org/igv/home|IGV]]//: The Integrative Genomics Viewer is a program for reading several types of indexed database information, including mapped reads and variant calls, and displaying them on a reference genome. It is invaluable as a tool for viewing and interpreting the “raw data” of many NGS data analysis pipelines.
 +    * //[[http://samtools.sourceforge.net/|samtools]]//: SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
 +
 +==== File formats explored: ====
 +    * //[[http://samtools.sourceforge.net/SAMv1.pdf|SAM]]//: Sequence alignment format, plain text.
 +    * //[[ http://www.broadinstitute.org/igv/bam |BAM]]//: Binary and compressed version of SAM.
 +
 +===== Exercise 1: Visualising sequencing data  =====
 +
 +Create a folder **visualization** and download de required files:
 +<code bash>
 +
 +mkdir visualization
 +cd visualization
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12878_child.bam
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12891_dad.bam
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/visualization/example_1/NA12892_mom.bam
 +</code>
 +
 +These datasets contain reads only for the [[http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000204681;r=6:29523406-29601753|GABBR1]] gene.
 +
 +
 +==== Creating indexed files ====
 +
 +Use **samtools** to index the bam files:
 +<code bash>
 +samtools index NA12878_child.bam
 +samtools index NA12891_dad.bam
 +samtools index NA12892_mom.bam
 +</code>
 +
 +Notice that three files have been created (with extension .bai):
 +<code bash>
 +ls -1
 +NA12878_child.bam
 +NA12878_child.bam.bai
 +NA12891_dad.bam
 +NA12891_dad.bam.bai
 +NA12892_mom.bam
 +NA12892_mom.bam.bai
 +</code>
 +
 +
 +
 +==== Run IGV ====
 +
 +You can run this command from the terminal using **igv**:
 +<code bash>
 +igv
 +</code>
 +
 +==== Download a reference genome  ====
 +
 +Run just in case you do not have downloaded //Human hg10 genome// before:
 +   * //Go to// **Genomes** -> **Load Genome From Server...** ->  //Select// **Human hg19**
 +
 +
 +==== Loading and browsing files  ====
 +    * //Go to// **File** -> **Load from file...** -> //Select NA12878_child.bam, NA12891_dad.bam and NA12892_mom.bam//
 +
 +**Steps**:
 +    - Enter the name of our gene (**GABBR1**) in the search box and hit **Go**.
 +    - Zoom in until you find some SNPs - they might be in exons or introns.
 +    - Identify at least one example of a short insertion variant and deletion around exon **4**.
 +    - Load and look at the SNP track: **File** -> **Load from server...** -> **Annotations** -> **Variants and Repeats** -> **dbSNP**
 +
 +
 +**NOTE**: Before continuing with the next step you should go back to your **working directory**:
 +
 +<code bash>
 +cd ..
 +</code>
 +
 +====== Variant Calling ======
 +===== Preliminaries =====
 +==== Software used in this practical ====
 +    * //[[http://samtools.sourceforge.net/ | SAMTools]]//: SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
 +    *//[[ http://picard.sourceforge.net/|Picard]]//:  Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files.
 +    * //[[ http://www.broadinstitute.org/gatk/ |GATK    ]]//: Genome Analysis Toolkit - A package to analyse next-generation re-sequencing data, primary focused on variant discovery and genotyping.
 +
 +
 +==== File formats explored: ====
 +    * //[[http://samtools.sourceforge.net/SAMv1.pdf|SAM]]//: Sequence alignment format, plain text.
 +    * //[[ http://www.broadinstitute.org/igv/bam |BAM]]//: Binary and compressed version of SAM.
 +    *//[[https://samtools.github.io/hts-specs/VCFv4.2.pdf|VCF]]//: Variant Call Format.
 +
 +===== Exercise 1: Variant calling with paired-end data =====
 +
 +Create a directory for this exercise:
 +<code bash>
 +mkdir calling
 +cd calling
 +</code>
 +
 +==== Prepare reference genome: generate the fasta file index ====
 +
 +Create a new directory called **genome**:
 +<code bash>
 +mkdir genome
 +</code>
 +
 +Now we can copy the genome from the previous step (mapping):
 +<code bash>
 +cp ../mapping/index/f000_chr21_ref_genome_sequence.fa genome/
 +</code>
 +
 +Or we can download the file with the next command:
 +<code bash>
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/f000_chr21_ref_genome_sequence.fa -P genome/
 +</code>
 +**NOTE**: we used option **-P** this time with **wget**, this option allows us to indicate the output folder.
 +
 +Use **samtools** to generate the fasta file index:
 +
 +<code bash>
 +samtools faidx genome/f000_chr21_ref_genome_sequence.fa
 +</code>
 +
 +This creates a file called samtools faidx f000_chr21_ref_genome_sequence.fa.fai, with one record per line for each of the contigs in the FASTA reference file.
 +
 +Generate the sequence dictionary using **Picard**.
 +<code bash>
 +picard CreateSequenceDictionary REFERENCE=genome/f000_chr21_ref_genome_sequence.fa OUTPUT=genome/f000_chr21_ref_genome_sequence.dict
 +</code>
 +
 +
 +==== Prepare BAM file====
 +
 +Create a **data** folder and download the required BAM by executing the next commands:
 +<code bash>
 +mkdir data
 +cd data
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/000-dna_chr21_100_hq_pe.bam
 +</code>
 +**NOTE**: this BAM file is the same file we created in the //maping// section.
 +
 +We must sort and index the BAM file before processing it with Picard and GATK. To sort the bam file we use **samtools**
 +<code bash>
 +samtools sort -o 001-dna_chr21_100_hq_pe_sorted.bam 000-dna_chr21_100_hq_pe.bam
 +</code>
 +
 +Index the BAM file:
 +<code bash>
 +samtools index 001-dna_chr21_100_hq_pe_sorted.bam
 +</code>
 +
 +
 +
 +==== Mark duplicates (using picard) ====
 +Run the following Picard command to mark duplicates:
 +<code bash>
 +picard MarkDuplicates INPUT=001-dna_chr21_100_hq_pe_sorted.bam OUTPUT=002-dna_chr21_100_hq_pe_sorted_noDup.bam METRICS_FILE=002-metrics.txt
 +</code>
 +
 +This creates a sorted BAM file called //002-dna_chr21_100_hq_pe_sorted_noDup.bam// with the same content as the input file, except that any duplicate reads are marked as such. It also produces a metrics file called //metrics.txt//.
 +
 +
 +Run the following Picard command to index the new BAM file:
 +<code bash>
 +picard BuildBamIndex INPUT=002-dna_chr21_100_hq_pe_sorted_noDup.bam
 +
 +</code>
 +
 +
 +==== Local realignment around INDELS (using GATK) ====
 +
 +There are 2 steps to the realignment process:
 +
 +First, create a target list of intervals which need to be realigned
 +
 +<code bash>
 +gatk -T RealignerTargetCreator -R ../genome/f000_chr21_ref_genome_sequence.fa -I 002-dna_chr21_100_hq_pe_sorted_noDup.bam -o 003-indelRealigner.intervals
 +</code>
 +
 +Second, perform realignment of the target intervals
 +
 +<code bash>
 +gatk -T IndelRealigner -R ../genome/f000_chr21_ref_genome_sequence.fa -I 002-dna_chr21_100_hq_pe_sorted_noDup.bam -targetIntervals 003-indelRealigner.intervals -o 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam
 +</code>
 +
 +This creates a file called //003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam// containing all the original reads, but with better local alignments in the regions that were realigned.
 +
 +==== Base quality score recalibration (using GATK) ====
 +
 +Two steps:
 +
 +First, analyse patterns of covariation in the sequence dataset.
 +
 +It is imperative that you provide the program with a set of **known sites**, otherwise it will refuse to run. The known sites are used to build the covariation model and estimate empirical base qualities. For details on what to do if there are no known sites available for your organism of study, please see the online GATK documentation.
 +
 +Download the **known sites** file:
 +<code bash>
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/000-dbSNP_chr21.vcf -P ..
 +</code>
 +
 +Execute tye GATK command:
 +<code bash>
 +gatk -T BaseRecalibrator -R ../genome/f000_chr21_ref_genome_sequence.fa -I 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam -knownSites ../000-dbSNP_chr21.vcf -o 004-recalibration_data.table
 +</code>
 +
 +This creates a GATKReport file called //004-recalibration_data.table// containing several tables. These tables contain the covariation data that will be used in a later step to recalibrate the base qualities of your sequence data.
 +
 +**NOTE**: this step could take several minutes, if you do not want to wait until the comand finish you can download the result file by executing the next command:
 +<code bash>
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-recalibration_data.table
 +</code>
 +
 +Second, apply the recalibration to your sequence data
 +
 +<code bash>
 +gatk -T PrintReads -R ../genome/f000_chr21_ref_genome_sequence.fa -I 003-dna_chr21_100_hq_pe_sorted_noDup_realigned.bam -BQSR 004-recalibration_data.table -o 004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam
 +</code>
 +
 +**NOTE**: this step could take several minutes, if you do not want to wait until the comand finish you can download the result files by executing the next commands:
 +<code bash>
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bai
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam
 +</code>
 +
 +
 +This creates a file called //004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam// containing all the original reads, but now with exquisitely accurate base substitution, insertion and deletion quality scores. By default, the original quality scores are discarded in order to keep the file size down. However, you have the option to retain them by adding the flag **–emit_original_quals** to the **PrintReads** command, in which case the original qualities will also be written in the file, tagged OQ.
 +
 +
 +==== Variant calling (using GATK - UnifiedGenotyper) ====
 +
 +SNPs are called using the next command:
 +<code bash>
 +gatk -T UnifiedGenotyper -R ../genome/f000_chr21_ref_genome_sequence.fa -I 004-dna_chr21_100_hq_pe_sorted_noDup_realigned_recalibrated.bam -o 005-dna_chr21_100_he_pe.vcf
 +</code>
 +
 +**NOTE**: this step could take several minutes, if you do not want to wait until the comand finish you can download the result files by executing the next commands:
 +<code bash>
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/005-dna_chr21_100_he_pe.vcf
 +wget http://bioinfo.cipf.es/apps-beta/gda16/ngstutorial/calling-old/example1/005-dna_chr21_100_he_pe.vcf.idx
 +</code>
 +
 +==== Introduce filters in the VCF file  ====
 +
 +Example: filter SNPs with low confidence calling (QD < 12.0) and flag them as “LowConf”.
 +
 +<code bash>
 +gatk -T VariantFiltration -R ../genome/f000_chr21_ref_genome_sequence.fa -V 005-dna_chr21_100_he_pe.vcf --filterExpression "QD < 12.0" --filterName "LowConf" -o 006-dna_chr21_100_he_pe_filtered.vcf
 +
 +</code>
 +
 +The command **--filterExpression** will read the INFO field and check whether variants satisfy the requirement. If a variant does not satisfy your filter expression, the field FILTER will be filled with the indicated **--filterName**. These commands can be called several times indicating different filtering expression (i.e: –filterName One –filterExpression “X < 1” –filterName Two –filterExpression “X > 2”).
 +
 +QUESTION: How many “LowConf” variants have we obtained?
 +
 +<code bash>
 +grep LowConf 006-dna_chr21_100_he_pe_filtered.vcf | wc -l
 +</code>