Salmonella outbreak

Salmonella outbreak

Outbreak scenario

First we will start with the outbreak scenario and data analysis tutorial, kindly contributed by Phil Ashton (previously of Public Health England).

Jump over to the tutorial documentation.

NGS data analysis using Nullarbor

We will try and reconstruct some of the analysis for this paper using the Nullarbor pipeline:

Quick, Ashton, et al.

The Nullarbor pipeline has the following steps:

  1. Read trimming
  2. De novo assembly
  3. MLST gene calling
  4. Mapping to reference
  5. Variant calling
  6. Antibiotic resistance gene detection
  7. Tree building
  8. Pan-genome construction

Please also refer to this tutorial for more details


The first task is to get read files from the short read archive.

Find the project accession number from the paper (see link above), and then find the project page on the European Nucleotide Archive - Short Read Archive.

I'm stuck!

The link to the project page is:

What format are the files in?

Download one to the server, how many reads are there?


Hint: wget is useful for downloading files directly to the server.


wc -l is good for counting lines, but remember to unzip first

Can you think of a way of downloading all the files in one go?


Hint: Use the TEXT view. UNIX scripting might help you. Consider using cut to pull out columns of interest.

What about if I just wanted to download the files from run 2 (these are denoted by file names beginning with ‘2_’


Hint: use grep

Download the files


You could just do this the slow way …

Advanced way

Or a more advanced way: cut -f12 filereport.txt | grep \/2_ | tr ';' '\n' | xargs wget -L 1 - what do these commands do?

To run Nullarbor, you need a reference file.

Where will you find one from?


The reference file should be in FASTA (or Genbank) format - how do you convert it?


seqret is good for this

Make an input file - refer to the Nullarbor documentation first

Nullarbor requires a specially formatted ‘input file’. Can you figure out how to make one?


The file needs to have three columns, tab-separated with isolate name, first read pair and second read pair

Advanced way

ls *_1.fastq.gz | sed 's/_1.fastq.gz//' | awk '{{ printf("%s\t%s_1.fastq.gz\t%s_2.fastq.gz\n", $1, $1, $1) }}' >

Now run Nullarbor

Hint --name salmonella10 --mlst senterica --ref GCF_000006945.1_ASM694v1_genomic.fna --input --outdir 10genomes

Now run the Makefile as instructed.

How will you ensure that Nullarbor will continue to run even if you lose connection?


screen or nohup is good for this

Interpreting Nullarbor output

Here are the Nullarbor results from 10 Salmonella genomes from this project:

10 Salmonella genomes

Shigella genomes - examples of AMR genes

Emily also has some Pseudomonas examples from the burns unit!

Sequence data

What is the average depth of coverage across the runs?

How does yield relate to depth of coverage?

Why are some marked ? on quality?

Species identification

Are the results as expected?

Why are multiple species identified? Are they expected?

Are they the same for each isolate? Why are they like this?


What do the MLST results tell us?


Is there evidence of antibiotic resistance?

Genome sizes

Are the genome assemblies the same length? Why might they be different?

Core genome

Does the % aligned bases vary? Why?

Core SNP phylogeny

How do you interpret the phylogenetic tree? Are any isolates very different? How different?

SNP distance

Do any regions of the genome have more mutations than others? When might this occur?

Pan genome

How do you interpret the pan genome plot? What would you like to know now? What files produced by Nullarbor help interpret this output?

Why does the pan genome phylogeny differ from the core genome phylogeny? Does that affect your interpretation of the outbreak?

Software versions

Why is it important to record the software versions?