What Percentage of My Reads Should Align
Mapping
Overview
Questions:Objectives:
What is mapping?
What two things are crucial for a correct mapping?
What is BAM?
Requirements:
You will larn what mapping is
A genome browser is shown that helps you to understand your data
- Introduction to Galaxy Analyses
- Sequence analysis
- Quality Command: slides slides - tutorial hands-on
Time estimation: i 60 minutes
Supporting Materials:
- Slides
- Datasets
- Workflows
- Tours
- FAQs
- video Recordings
- instances Available on these Galaxies
Last modification: Nov 15, 2021
Introduction
Sequencing produces a collection of sequences without genomic context. We do not know to which part of the genome the sequences stand for to. Mapping the reads of an experiment to a reference genome is a primal step in modern genomic data analysis. With the mapping the reads are assigned to a specific location in the genome and insights like the expression level of genes can exist gained.
The short reads practise non come with position data, and then nosotros do not know what part of the genome they came from. We need to use the sequence of the read itself to find the corresponding region in the reference sequence. But the reference sequence can be quite long (~three billion bases for human), making information technology a daunting job to find a matching region. Since our reads are short, there may be several, equally likely places in the reference sequence from which they could accept been read. This is particularly true for repetitive regions.
In principle, nosotros could do a Smash analysis to figure out where the sequenced pieces fit best in the known genome. We would need to do that for each of the millions of reads in our sequencing data. Aligning millions of short sequences this manner may, however, take a couple of weeks. And we practise not care about the verbal base of operations to base of operations correspondence (alignment). What we are interested in is "where these reads came from". This arroyo is called mapping .
In the post-obit, we volition procedure a dataset with the mapper Bowtie2 and we volition visualize the information with the program IGV.
Agenda
In this tutorial, we will deal with:
- Ready the data
- Map reads on a reference genome
- Inspection of a BAM file
- Visualization using a Genome Browser (IGV)
- Visualization using a Genome Browser (JBrowse)
Ready the data
hands_on Hands-on: Data upload
Create a new history for this tutorial and give it a proper name
Tip: Creating a new history
Click the new-history icon at the top of the history console.
If the new-history is missing:
- Click on the milky way-gear icon (History options) on the meridian of the history panel
- Select the option Create New from the carte
Tip: Renaming a history
- Click on Unnamed history (or the current name of the history) (Click to rename history) at the height of your history panel
- Type the new name
- Press Enter
Import
wt_H3K4me3_read1.fastq.gzandwt_H3K4me3_read2.fastq.gzfrom Zenodo or from the data library (enquire your instructor)https://zenodo.org/record/1324070/files/wt_H3K4me3_read1.fastq.gz https://zenodo.org/record/1324070/files/wt_H3K4me3_read2.fastq.gzTip: Importing via links
- Copy the link location
Open the Galaxy Upload Manager ( galaxy-upload on the meridian-right of the tool panel)
- Select Paste/Fetch Data
Paste the link into the text field
Printing Start
- Close the window
Tip: Importing data from a data library
As an alternative to uploading the data from a URL or your calculator, the files may also have been fabricated available from a shared data library:
- Get into Shared data (top panel) then Data libraries
- Navigate to the correct folder as indicated past your instructor
- Select the desired files
- Click on the To History button virtually the top and select as Datasets from the dropdown menu
- In the popular-upwards window, select the history yous want to import the files to (or create a new one)
- Click on Import
As default, Galaxy takes the link as proper name, so rename them.
Rename the files to
reads_1andreads_2Tip: Renaming a dataset
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central console, change the Proper noun field
- Click the Relieve button
We just imported in Milky way FASTQ files corresponding to paired-finish data as we could get directly from a sequencing facility.
During sequencing, errors are introduced, such every bit incorrect nucleotides being chosen. Sequencing errors might bias the analysis and tin lead to a misinterpretation of the information. The first footstep for any type of sequencing data is ever to check their quality.
There is a dedicated tutorial for quality control of sequencing information. We volition non repeat the steps there. You should follow the tutorial and apply it to your information before going further.
Map reads on a reference genome
Read mapping is the process to align the reads on a reference genomes. A mapper takes as input a reference genome and a set of reads. Its aim is to align each read in the set of reads on the reference genome, assuasive mismatches, indels and clipping of some short fragments on the 2 ends of the reads:
We need a reference genome to map the reads on.
question Questions
- What is a reference genome?
- For each model organism, several possible reference genomes may be bachelor (east.g.
hg19andhg38for man). What practise they correspond to?- Which reference genome should we utilize?
solution Solution
- A reference genome (or reference assembly) is a set of nucleic acid sequences assembled as a representative example of a species' genetic textile. Every bit they are often assembled from the sequencing of different individuals, they practice not accurately correspond the set of genes of any single organism, only a mosaic of unlike nucleic acid sequences from each individual.
- Every bit the cost of Dna sequencing falls, and new full genome sequencing technologies emerge, more genome sequences go along to be generated. Using these new sequences, new alignments are congenital and the reference genomes improved (fewer gaps, fixed misrepresentations in the sequence, etc). The dissimilar reference genomes represent to the different released versions (chosen "builds").
- This data comes from the ChIP-seq of mice, so we volition use mm10 (Mus musculus).
Currently, there are over sixty different mappers, and their number is growing. In this tutorial, we will use Bowtie2, a fast and retentivity-efficient open-source tool particularly good at aligning sequencing reads of almost 50 upwards to 1,000s of bases to relatively long genomes.
hands_on Easily-on: Mapping with Bowtie2
- Bowtie2 Tool: toolshed.g2.bx.psu.edu/repos/devteam/bowtie2/bowtie2/2.4.2+galaxy0 with the following parameters
- "Is this single or paired library":
Paired-end
- param-file "FASTA/Q file #i":
reads_1- param-file "FASTA/Q file #ii":
reads_2"Do you lot want to prepare paired-finish options?":
NoYou should accept a look at the parameters there, specially the mate orientation if you know it. They can improve the quality of the paired-end mapping.
- "Will you select a reference genome from your history or use a born index?":
Apply a built-in genome alphabetize
- "Select reference genome":
Mouse (Mus musculus): mm10"Select analysis manner":
Default setting justYou should have a look at the non default parameters and try to understand them. They tin accept an affect on the mapping and improving it.
- "Save the bowtie2 mapping statistics to the history":
YesInspect the
mapping statsfile by clicking on the milky way-centre (centre) iconquestion Questions
- What information is provided here?
- How many reads have been mapped exactly 1 time?
- How many reads have been mapped more than ane fourth dimension? How is it possible? What should we practise with them?
- How many pair of reads have non been mapped? What are the causes?
solution Solution
- The information given here is a quantity i. We can see how many sequences are aligned. Information technology does not tell us something about the quality.
- ~90% reads have been aligned exactly 1 fourth dimension
- ~7% reads take been aligned concordantly >1 times. These are called multi-mapped reads. It tin happen considering of repetitions in the reference genome (multiple copies of a gene for instance), particularly when the reads are minor. It is hard to decide where these sequences come from and therefore well-nigh of the pipelines ignore them. Always check the statistics there to be sure of not discarding too much data in any downstream analyses.
- ~3% pair of reads have not been mapped because
- both reads in the pair aligned but their positions do not hold with pair of reads (
aligned discordantly one time)- reads of these pairs are multi-mapped (
aligned >1 timesinpairs aligned 0 times concordantly or discordantly)- one read of these pairs are mapped but not the paired read (
aligned exactly 1 timeinpairs aligned 0 times concordantly or discordantly)- the rest are non mapped at all
Checking the mapping statistics is an important step to do before continuing any analyses. There are several potential sources for errors in mapping, including (just not limited to):
- Polymerase Chain Reaction (PCR) artifacts: Many high-throughput sequencing (HTS) methods involve 1 or multiple PCR steps. PCR errors will show equally mismatches in the alignment, and especially errors in early PCR rounds volition show up in multiple reads, falsely suggesting genetic variation in the sample. A related mistake would exist PCR duplicates, where the same read pair occurs multiple times, skewing coverage calculations in the alignment.
- Sequencing errors: The sequencing machine can brand an erroneous call either for physical reasons (e.g. oil on an Illumina slide) or due to properties of the sequenced DNA (due east.g., homopolymers). As sequencing errors are often random, they tin can be filtered out as singleton reads during variant calling.
- Mapping errors: The mapping algorithm can map a read to the wrong location in the reference. This oftentimes happens around repeats or other low-complication regions.
So if the mapping statistics are non good, you should investigate the cause of these errors before going further in your analyses.
After that, y'all should have a look at the reads and inspect the BAM file where the read mappingsouth are stored.
Inspection of a BAM file
A BAM (Binary Alignment Map) file is a compressed binary file storing the read sequences, whether they have been aligned to a reference sequence (e.g. a chromosome), and if so, the position on the reference sequence at which they have been aligned.
hands_on Hands-on: Inspect a BAM/SAM file
- Audit the param-file output of Bowtie2 tool
A BAM file (or a SAM file, the non-compressed version) consists of:
- A header section (the lines starting with
@) containing metadata particularly the chromosome names and lengths (lines starting with the@SQsymbol) -
An alignment section consisting of a table with eleven mandatory fields, as well every bit a variable number of optional fields:
Col Field Type Brief Description i QNAME String Query template Name 2 FLAG Integer Bitwise FLAG 3 RNAME String References sequence NAME 4 POS Integer i- based leftmost mapping POSition 5 MAPQ Integer MAPping Quality 6 CIGAR String CIGAR Cord 7 RNEXT String Ref. proper noun of the mate/next read 8 PNEXT Integer Position of the mate/next read 9 TLEN Integer Observed Template LENgth 10 SEQ String Segment SEQuence 11 QUAL String ASCII of Phred-scaled base of operations QUALity+33
question Questions
- Which information practise you discover in a SAM/BAM file?
- What is the additional information compared to a FASTQ file?
solution Solution
- Sequences and quality information, similar a FASTQ
- Mapping information, Location of the read on the chromosome, Mapping quality, etc
The BAM file includes a lot of data almost each read, particularly the quality of mapping.
hands_on Hands-on: Summary of mapping quality
- Samtools Stats Tool: toolshed.g2.bx.psu.edu/repos/devteam/samtools_stats/samtools_stats/2.0.ii+milky way2 with the following parameters
- param-file "BAM file":
aligned reads(output of Bowtie2 tool)- "Utilise reference sequence":
Locally cached/Use a congenital-in genome
- "Using genome":
Mouse (Mus musculus): mm10 TotalInspect the param-file
Statsfilequestion Questions
- What is the proportion of mismatches in the mapped reads when aligned to the reference genome?
- What does the mistake rate represent?
- What is the average quality? How is it represented?
- What is the insert size average?
- How many reads accept a mapping quality score below 20?
solution Solution
- There are ~21,900 mismatches for ~four,753,900 bases mapped which on boilerplate produces ~0.005 mismatches per mapped bases.
- The fault rate is the proportion of mismatches per mapped bases, so the ratio computed right before.
- The boilerplate quality is the hateful quality score of the mapping. It is a Phred score like the one used in the FASTQ file for each nucleotide. But here the score is not per nucleotide, but per read and information technology represents the probability of mapping quality.
- The insert size is the distance between the ii reads in the pairs.
- To get the info:
- Filter BAM datasets on a diversity of attributes tool with a filter to go on only the reads with a mapping quality >= twenty
- Stats generate statistics for BAM dataset tool on the output of Filter
Earlier filtering: 95,412 reads and afterwards filtering: 89,664 reads.
Visualization using a Genome Browser (IGV)
The Integrative Genomics Viewer (IGV) is a high-performance visualization tool for interactive exploration of large, integrated genomic datasets. Information technology supports a broad variety of data types, including array-based and next-generation sequence data, and genomic annotations. In the following, nosotros will use information technology to visualize the mapped reads.
hands_on Hands-on: Visualization of the reads in IGV
There are two ways to run information technology:
- If you have IGV installed (or want to install it):
- Install IGV
- Launch IGV on your computer
- Aggrandize the param-file output of Bowtie2 tool
- Click on the
localindisplay with IGVto load the reads into the IGV browser- If you lot do not have IGV
- Click on the
Mouse mm10(or correct organism) indisplay with IGVto load the reads into the IGV browser- Zoom on the
chr2:98,666,236-98,667,473
The reads have a direction: they are mapped to the forwards or reverse strand, respectively. When hovering over a read, extra information is displayed
question Questions
- What could information technology mean if a bar in the coverage view is colored?
- What could be the reason why a read is white instead of grey?
solution Solution
- If a nucleotide differs from the reference sequence in more than 20% of quality weighted reads, IGV colors the bar in proportion to the read count of each base.
- They have a mapping quality equal to cipher. Interpretation of this mapping quality depends on the mapping aligner as some commonly used aligners use this convention to marking a read with multiple alignments. In such a case, the read also maps to some other location with equally good placement. It is besides possible that the read could not be uniquely placed but the other placements do not necessarily give every bit skillful quality hits.
Visualization using a Genome Browser (JBrowse)
JBrowse Tool: toolshed.g2.bx.psu.edu/repos/iuc/jbrowse/jbrowse/1.xvi.11+galaxy0 is an alternative, web-based genome browser. Whereas IGV is a slice of software y'all must download and run, JBrowse instances are websites hosted online that provide an interface to browse genomics data. Nosotros'll use it to visualise the mapped reads.
hands_on Hands-on: Visualization of the reads in JBrowse
- JBrowse tool genome browser, with the following parameters:
- "Reference genome to brandish": Utilize a built-in genome
- "Select a reference genome":
mm10- "JBrowse-in-Galaxy Activity":
New JBrowse Case- "Insert Track Group":
- "Insert Note Track":
- "Track Type": BAM Pileups
- "BAM Runway Data":
aligned reads(output of Bowtie2 tool)- "Autogenerate SNP Track": Aye
- "Track Visibility": On for new users
- Visualise the dataset galaxy-center
- Zoom on the
chr2:98,666,236-98,667,473
The reads have a management: they are mapped to the frontward or contrary strand, respectively. When clicking on a read, extra data is displayed
question Questions
- What do the teardrop shape and line hateful in the autogenerated SNP track?
- What do differently coloured reads mean?
solution Solution
- If enough reads accept a different value, then information technology is marked with a teardrop icon. The coverage plot is marked in height with the percentage of reads with a dissimilar telephone call at that position
Colour Codes:
Colour Meaning Original red Forrard strand Original bluish Reverse strand Hard carmine Forward strand, missing mate Difficult Blueish Reverse strand, missing mate Light ruby Forward strand not proper Low-cal blueish Reverse strand, not proper Blackness Forwards, unequal chr Greyness Reverse, diff chr Grey No strand
Conclusion
After quality control, mapping is an important footstep of about analyses of sequencing data (RNA-Seq, ChIP-Seq, etc) to determine where in the genome our reads originated from and use this data for downstream analyses.
Key points
Know your information!
Mapping is not trivial
There are many mapping algorithms, it depends on your data which one to cull
Frequently Asked Questions
Have questions almost this tutorial? Check out the tutorial FAQ folio or the FAQ folio for the Sequence analysis topic to see if your question is listed in that location. If not, delight ask your question on the GTN Gitter Channel or the Galaxy Help Forum
Useful literature
Further data, including links to documentation and original publications, regarding the tools, analysis techniques and the interpretation of results described in this tutorial can be found here.
Feedback
Did you use this material as an instructor? Feel costless to requite us feedback on how it went.
Did y'all apply this cloth every bit a learner or student? Click the form below to exit feedback.
Citing this Tutorial
- Joachim Wolff, Bérénice Batut, Helena Rasche, 2021 Mapping (Galaxy Training Materials). https://preparation.galaxyproject.org/training-textile/topics/sequence-analysis/tutorials/mapping/tutorial.html Online; accessed TODAY
- Batut et al., 2018 Customs-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
details BibTeX
@misc{sequence-assay-mapping, author = "Joachim Wolff and Bérénice Batut and Helena Rasche", championship = "Mapping (Galaxy Grooming Materials)", year = "2021", month = "eleven", 24-hour interval = "xv" url = "\url{https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html}", note = "[Online; accessed TODAY]" } @article{Batut_2018, doi = {10.1016/j.cels.2018.05.012}, url = {https://doi.org/10.1016%2Fj.cels.2018.05.012}, year = 2018, month = {jun}, publisher = {Elsevier {BV}}, volume = {6}, number = {vi}, pages = {752--758.e1}, author = {B{\'{due east}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\5{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning}, title = {Customs-Driven Data Assay Training for Biological science}, journal = {Prison cell Systems} }
Congratulations on successfully completing this tutorial!
Practice you want to extend your noesis? Follow 1 of our recommended follow-up trainings:
- Transcriptomics
- Reference-based RNA-Seq data analysis: tutorial hands-on
- Epigenetics
- Formation of the Super-Structures on the Inactive 10: slides slides - tutorial hands-on
Source: https://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html
0 Response to "What Percentage of My Reads Should Align"
Post a Comment