Time - 3 hrs
Difficulty - Quite
As Alan explained in his talk. There are basically only two things you can do with your reads - Alignment or assembly. Running the programs that perform these tasks is relatively simple, the hard part is extracting and interpreting the information afterwards.
CONFUSED??? An excellent point resource for all things relating to microbial sequencing is here. These are a series of easy to follow slides relating to microbial genomics. I would particularly recommend these slides as they cover a lot of things in this tutorial.
Go to where your data is located
In here you will find your reads in fastq format (usually denoted as fastq.gz). There are two files (read1 and it’s partner read2). When you receive sequencing data it is often raw. This means that nothing has been done to it. It is quite common that you need to trim adapters and possibly quality trim (based on Q-score, see Alan’s talk). The data that you have been given here is adaptor trimmed and quality trimmed for your convenience. It is something to bear in mind when you receive data as untrimmed reads might affect your results. To trim reads we use Trimmomatic (which is installed on your GVL) but there are many tools available.
In an ideal world all sequenced genomes would assemble into one contig and their respective plasmids. Most of the reference genomes that are available like this have either been sanger sequenced, closed (using sanger sequencing), or sequenced using long read technology.
Sanger sequencing of a whole genome is costly and time consuming. Closing a fragmented genome comprised of multiple contigs can require quite a lot of work, designing primers for the edges of the contigs and performing PCR across the gaps. Long read technology is available using PacBio or Oxford nanopore instruments (for example), however, the former is at least 10 times more expensive than short read sequencing and the latter hasn’t been rolled out for public use…yet.
Most sequencing projects involve Illumina technology (discussed in Alan’s talk). This produces short reads, ranging between 36-250bp. Assembly using this type of data often results in many fragments rather than a circularised genome. These fragments are called contigs.
You tend to get a break in the assembly at repetitive regions (such as tandem repeats or genes that have multiple copies in the genome, such as 16s rRNA). The reads just aren’t long enough to span these regions.
The data that you are going to play with today is actually 250bp paired-end. This means that each read is 250bp and has a partner, this can improve your assembly because you have two reads that you know approximately know how far they are from one another. This sometimes allows you to span certain repetitive regions, thus improving the assembly.
- Firstly log into your terminal as described here
Most bioinformatics tools don’t have a fancy interface and require you to run them from the command line.
We are going to start by using a piece of software called SPAdes. This does not need a reference genome, it shreds the reads into smaller pieces called k-mers and aligns these to one another using a de bruijn graph. If you are interested in how this works look here. An assembly without a reference is called a de novo assembly.
There are many parameters that can be used with spades, the link to SPAdes above elaborates on these, or you can type:
Try running this command in the terminal
spades.py -1 set_a/3771_1_trimmed.fastq.gz -2 set_a/3771_2_trimmed.fastq.gz --careful -o 3771_assembly
The parameters that we used are:
-t #this tells SPAdes how many cores in you VM to use.
-1 #this tells SPAdes where the read1 file is located
-2 #this tells SPAdes where the read2 file is located
–careful #this is telling SPAdes to map the reads back to the assembly to error correct
-o #this is telling SPAdes what to call the new directory that it is making.
Once the assembly has finished lets have a look at the output that SPAdes spit out.
For the purposes of this tutorial we are just going to look at contigs.fasta
contigs.fasta, as the name suggests, contains your assembled contigs in multifasta format.
- Can you use grep to figure out how many contigs this assembly is comprised of?
- How are the contigs ordered? # clue: use grep and look at the description in the headers (SPAdes refers to contigs as NODEs)
ANSWER AFTER TUTORIAL
There are a few things that we look at to assess the quality of an assembly (other than the number of reads). There is a piece of software called quast which gives some general assembly statistics. Unfortunately…it isn’t installed on your machine
But wait, this is your machine, you can install software yourselves! Let’s have a go.
Ensure that you are in the right directory (pwd command). You should be in /home/ubuntu/Desktop/NGS, from here use wget to download the software.
# this command is just saying use wget to download this file and put it in the current directory) wget 'https://sourceforge.net/projects/quast/files/quast-4.3.tar.gz' . # tar.gz files are compressed (a bit like zip files in Windows), tar is extracting the files into a directory called quast-4.3.tar.gz tar -zxvf quast-4.3.tar.gz # go into the new directory cd quast-4.3 #The documentation for this software (if you want to read it type less README.md) tells you to install the software with this command, so let's try (it make take a little while). ./setup.py install_full
WELL DONE! You installed your first piece of software!
Now let’s have a look at the quality of your genome, we’ll include another assembly sem_assembly/sem.fasta so that we can compare their stats.
#This command is saying run quast against these two assemblies and output into a folder called check_assemblies quast.py 3771_assembly/contigs.fasta sem_assembly/sem.fasta -o check_assemblies
Open your virtual desktop /vnc (e.g mine is http://18.104.22.168/vnc/), you can see a folder called NGS open this and then open check_assemblies/report.html
Which assembly is better? Why?
I have downloaded the entire nucleotide database that BLAST uses and put it in /home/ubuntu/Desktop/NGS/blast_db
# DON'T TRY THIS NOW # but in the future when you want to ensure that your database is completely up to date try this cd /home/ubuntu/NGS/blast_db /mnt/galaxy/tools/blast+/2.2.28/iuc/package_blast_plus_2_2_28/767d3d3b5cbe/update_blastdb.pl nt
You have already tried web blast, now we are going to perform it on the command line to try to figure out whatspecies your assembly belongs to.
# USE THIS - For the sake of this tutorial we will take the first 10 lines from each contig and blast them grep -A10 ">" 3771_assembly/contigs.fasta | sed 's/-//'g | head -n110 | blastn -query - -db blast_db/nt -outfmt '6 qseqid length sseqid stitle pident' -num_alignments 1 > 3771_blast_contigs.out # DON'T RUN THIS NOW - if you wanted to balst the whole contigs you would use this command. blastn -query 3771_assembly/contigs.fasta -db blast_db/nt -outfmt '6 qseqid length sseqid stitle pident' -num_alignments 1 > 3771_blast_contigs.out
This command has taken each contig and blasted it against the entire nucleotide database and then taken the top hit and the first hsp. This has been saved in a file called 3771_blast_contigs.out. Here is an example of the output, you might find that your top hits are somewhat different…
NODE_1_length_546572_cov_29.0065_ID_1330 87240 gi|669634860|gb|CP008787.1| Campylobacter jejuni subsp. jejuni strain MTVDSCj20, complete genome
NODE_2_length_321420_cov_28.8952_ID_1332 120287 gi|669634860|gb|CP008787.1| Campylobacter jejuni subsp. jejuni strain MTVDSCj20, complete genome
NODE_3_length_237887_cov_30.7473_ID_1334 111251 gi|627770052|gb|CP005388.1| Campylobacter jejuni subsp. jejuni CG8421, complete genome
NODE_4_length_207494_cov_31.5446_ID_1336 73182 gi|315057441|gb|CP001960.1| Campylobacter jejuni subsp. jejuni S3, complete genome
NODE_5_length_181116_cov_34.0493_ID_1338 58718 gi|284925303|gb|CP001876.1| Campylobacter jejuni subsp. jejuni IA3902, complete genome
NODE_6_length_102153_cov_32.7344_ID_1340 40644 gi|756009372|gb|CP010306.1| Campylobacter jejuni subsp. jejuni strain 00-1597, complete genome
NODE_7_length_51041_cov_27.3065_ID_1342 21267 gi|998213231|gb|CP014345.1| Campylobacter jejuni strain RM3194 plasmid, complete sequence
NODE_8_length_42496_cov_26.7404_ID_1344 23866 gi|669634860|gb|CP008787.1| Campylobacter jejuni subsp. jejuni strain MTVDSCj20, complete genome
NODE_9_length_42052_cov_21.2926_ID_1346 19597 gi|943353879|gb|CP013033.1| Campylobacter coli strain CO2-160 plasmid pccdm3, complete sequence
NODE_10_length_19638_cov_19.8489_ID_1348 17419 gi|1004803719|gb|CP014745.1| Campylobacter jejuni strain OD267 plasmid pCJDM67 L, complete sequence
We can get a summary of what’s in there using uniq -c (the -c shows that count for each uniq occurrence)
ANSWER AFTER TUTORIAL - we'll work through this together
Now we know what our genome is, let’s blast against it’s reference and compare. I have downloaded the reference for your convenience.
Let’s run blast to compare the reference against our contigs.
blastn -query 3771_assembly/contigs.fasta -subject refs/Escherichia_coli_K_12_substr__W3110_uid161931.fna -outfmt 6 > 3771_vs_W3110.out
We can visualise this in ACT. On your desktop click the button in the bottom left corner and then select Run (second from bottom of list)
Type in act and then press return.
You will now see the menu for act.
- Open these three files in this order in act 3771_assembly/contigs.fasta , 3771_vs_W3110.out, refs/Escherichia_coli_K_12_substr__W3110_uid161931.fna
How does that look? Probably not useful!
We can run a piece of software called abacas to order these contigs
./abacas.1.3.1.pl -r refs/Escherichia_coli_K_12_substr__W3110_uid161931.fna -q 3771_assembly/contigs.fasta -p nucmer -m -b -o 3771_vs_W3110_ordered
This command tells abacas to align and order the contigs in contigs.fasta against the reference.
-r flag tells abacas where the reference fasta file is
-q points to your contigs
-b flag tells abacas to keep any contigs that don’t align to the reference and
-a puts them on the end of the ordered contigs
-m tells abacas to output into multifasta
-o is the prefix on all output files.
Now lets blast this ordered version of the contigs against the reference
blastn -query 3771_vs_W3110_ordered.fasta -subject refs/Escherichia_coli_K_12_substr__W3110_uid161931.fna -outfmt 6 > 3771_vs_W3110_ordered.out
- Open these files in act now, 3771_vs_W3110_ordered.fasta, 3771_vs_W3110_ordered.out, Escherichia_coli_K_12_substr__W3110_uid161931.fna
How does the genome comparison look now?
Now that we have a reasonable looking assembly let’s annotate it. Luckily there is a tool to do this too!
prokka --locustag 3771 --outdir 3771_ordered_PROKKA --prefix 3771 --rawproduct 3771_vs_W3110_ordered.fasta
Prokka takes the contigs, predicts features such as RNAs and CDS. The CDS are then compared to a database of existing proteins to get the annotation.
–locustag assign the locus_tag prefix 3771
–outdir call the output directory 3771_ordered_PROKKA
–prefix assign 3771 as a prefix to all output files
–rawproduct this keeps the original annotation not the cleaned one for genbank submission (the raw products tend to have more information)
- This output directory contains lots of files, take a look and figure out what they are.
You can now open the annotation in artemis. Try opening 3771_PROKKA/3771.gbk in artemis (the program art). You can find artemis the same way that you found the act program (above) but type art.
What’s wrong with this annotation?
Now try opening the 3771_PROKKA/3771.gff
Alignment and variant calling
At MicrobesNG we always sequence a control (E.coli) on every sequencing run. The researcher that gets does the extractions doesn’t know exactly which strain this is derived from (they think it’s one of two). We want to know which reference is closest to our sample. Further to this, periodically, when we run out of DNA we need to re-extract more DNA. We have samples from two different extractions. we want to see if there are differences between these two controls, are more mutations introduced over time?
The normal method of calling SNPs is to align to the reference (using something like bwa or tophat), sort and manipulate the alignment file (.bam), look for SNPs/indels using a variant caller (like freebayes or varscan). Nick is going to show you some worked examples with these tools.
Today, we are going to use SNIPPY which is essentially a wrapper for the tasks mentioned above (it runs all of the software for you). SNIPPY not only does your alignment but performs variant calling too! We are going to use default parameters when rerun SNIPPY with the exception that we want to save any reads that didn’t map to the reference. Parameters that users might consider tweaking are:
- –mincov is the minimum number of reads covering the variant position.
- –minfrac is the minimum proportion of those reads which must differ from the reference.
Note: You can actually use SNIPPY to create a core genome and create a phylogenetic tree too! If you have multiple samples.
The half of the room next to the window are going to use set_a and the other half of the room will use set_1
If you have set_A
# compare to ecoli ref1 snippy --outdir MG1655_snippy --ref refs/Escherichia_coli_K_12_substr__MG1655_uid57779.fna --unmapped --R1 set_a/3771_1_trimmed.fastq.gz --R2 set_a/3771_2_trimmed.fastq.gz # compare to ecoli ref2 snippy --outdir W3110_snippy --ref refs/Escherichia_coli_K_12_substr__W3110_uid161931.fna --unmapped --R1 set_a/3771_1_trimmed.fastq.gz --R2 set_a/3771_2_trimmed.fastq.gz
If you have set_1
snippy --outdir MG1655_snippy --ref refs/Escherichia_coli_K_12_substr__MG1655_uid57779.fna --unmapped --R1 set_1/9752_1_trimmed.fastq.gz --R2 set_1/9752_2_trimmed.fastq.gz
# compare to ecoli ref2
snippy --outdir W3110_snippy --ref refs/Escherichia_coli_K_12_substr__W3110_uid161931.fna --unmapped --R1 set_1/9752_1_trimmed.fastq.gz --R2 set_1/9752_2_trimmed.fastq.gz
Look in folder W3110_snippy, there are lots of file formats. The formats of interest in this tutorial are:
Bam - is the alignment of all the reads
name.unmapped.fastq.gz is all the reads that did not map to the reference.
snps.tab shows all of your snps in the context of the genome.
Which is a better reference?
- How many SNPs are there?
- How big are the unmapped reads files?
What genes are the SNPs in and are there any differences between
Finally, let’s have a go at opening these in IGV.