Salmonella outbreak
Outbreak scenario
First we will start with the outbreak scenario and data analysis tutorial, kindly contributed by Phil Ashton (previously of Public Health England).
Jump over to the tutorial documentation.
NGS data analysis using Nullarbor
We will try and reconstruct some of the analysis for this paper using the Nullarbor pipeline:
The Nullarbor pipeline has the following steps:
- Read trimming
- De novo assembly
- MLST gene calling
- Mapping to reference
- Variant calling
- Antibiotic resistance gene detection
- Tree building
- Pan-genome construction
Please also refer to this tutorial for more details
Salmonella
The first task is to get read files from the short read archive.
Find the project accession number from the paper (see link above), and then find the project page on the European Nucleotide Archive - Short Read Archive.
I'm stuck!
The link to the project page is: http://www.ebi.ac.uk/ena/data/view/PRJEB7465
What format are the files in?
Download one to the server, how many reads are there?
Hint
Hint: wget
is useful for downloading files directly to the server.
Hint
wc -l
is good for counting lines, but remember to unzip first
Can you think of a way of downloading all the files in one go?
Summary
Hint: Use the TEXT view. UNIX scripting might help you. Consider using cut
to pull out columns of interest.
What about if I just wanted to download the files from run 2 (these are denoted by file names beginning with ‘2_’
Summary
Hint: use grep
Download the files
Hint
You could just do this the slow way …
Advanced way
Or a more advanced way: cut -f12 filereport.txt | grep \/2_ | tr ';' '\n' | xargs wget -L 1
- what do these commands do?
To run Nullarbor, you need a reference file.
Where will you find one from?
The reference file should be in FASTA (or Genbank) format - how do you convert it?
Hint
seqret
is good for this
Make an input file - refer to the Nullarbor documentation first
Nullarbor requires a specially formatted ‘input file’. Can you figure out how to make one?
Hint
The file needs to have three columns, tab-separated with isolate name, first read pair and second read pair
Advanced way
ls *_1.fastq.gz | sed 's/_1.fastq.gz//' | awk '{{ printf("%s\t%s_1.fastq.gz\t%s_2.fastq.gz\n", $1, $1, $1) }}' > allinput.tab
Now run Nullarbor
Hint
nullarbor.pl --name salmonella10 --mlst senterica --ref GCF_000006945.1_ASM694v1_genomic.fna --input 10input.tab --outdir 10genomes
Now run the Makefile as instructed.
How will you ensure that Nullarbor will continue to run even if you lose connection?
Hint
screen
or nohup
is good for this
Interpreting Nullarbor output
Here are the Nullarbor results from 10 Salmonella genomes from this project:
Shigella genomes - examples of AMR genes
Emily also has some Pseudomonas examples from the burns unit!
Sequence data
What is the average depth of coverage across the runs?
How does yield relate to depth of coverage?
Why are some marked ? on quality?
Species identification
Are the results as expected?
Why are multiple species identified? Are they expected?
Are they the same for each isolate? Why are they like this?
MLST
What do the MLST results tell us?
Antibiogram
Is there evidence of antibiotic resistance?
Genome sizes
Are the genome assemblies the same length? Why might they be different?
Core genome
Does the % aligned bases vary? Why?
Core SNP phylogeny
How do you interpret the phylogenetic tree? Are any isolates very different? How different?
SNP distance
Do any regions of the genome have more mutations than others? When might this occur?
Pan genome
How do you interpret the pan genome plot? What would you like to know now? What files produced by Nullarbor help interpret this output?
Why does the pan genome phylogeny differ from the core genome phylogeny? Does that affect your interpretation of the outbreak?
Software versions
Why is it important to record the software versions?