Genome-wide Association Studies (GWAS) - John Lees and Sam Sheppard

SEERS GWAS tutorial
John Lees and Sam Sheppard

Using 603 pneumococcal genomes isolated from the nasopharynx of US (MA) children. 224 were resistant or intermediately resistant to penicillin.
Croucher et. al. Nat Gen 6(45) 656–663 http://dx.doi.org/10.1038/ng.2625

We start with genomes assembled by velvet (in seer_data/assemblies) and the list of which were resistant and susceptible to penicillin (seer_data/penicillin.pheno)

  1. counts kmers
    (not run)
    fsm-lite -v -l fsm_files.txt -t tmp_idx -s 10 -S 593 > fsm_kmers.txt
    split -d -n l/16 fsm_kmers.txt fsm_out
    gzip fsm_out*

These first steps count the kmers from the assemblies and split the output into files to facilitate parallelisation later. We use kmers to test for association as these allow us to look for both gene and SNP variation simultaneously, and they don’t rely on a single reference genome.

As this step takes about 4 hours the results have been pre-calculated in seer_data/kmers

  1. correct for population structure, weighted by sequence identity of isolate genomes (without the use of a tree).
    (population structure)

cd seer_data/population_structure

mash sketch -o reference …/assemblies/*.fa

mash dist reference.msh reference.msh > mash_distances.txt

perl ~/seer/scripts/mash2matrix.pl mash_distances.txt > all_distances.csv

perl ~/seer/scripts/R_mds.pl -d all_distances.csv -p …/penicillin.pheno -o projection

  1. Create a minhash of all the assemblies
  2. Calculate the pairwise distance between all samples
  3. Convert the mash output into a distance matrix
  4. Project the 603x603 distance matrix into 3 dimensions

Bacteria are clonal, and so a simple test of association will bring up all variants in a lineage associated with the phenotype (penicillin resistance) as well as the causal variants (SNPs in PVP genes). By assessing the similarity between samples we can approximate this population structure, and control for it in seer.

  1. Run all regressions kmer presence/absence Vs Phenotype
    (seer)
    cd …/run_seer
    seer -k …/kmers/fsm_out00.gz -p …/penicillin.pheno --struct …/population_structure/projection --threads 8 --maf 0.05 > seer00.txt

This runs the association. It uses the counted kmers -k, the phenotype -p and the population structure calculated above --struct. We also filter out low frequency kmers, and multithread on the vm for speed.

This command can be run in a reasonable amount of time (<20 minutes) to get the output from one of the kmer files. Using less -S, have a look in the output file to see what seer generates as output.

This can then be filtered to just those kmers positively associated with resistance
filter_seer -k seer00.txt --pos_beta > seer00_filtered.txt

(or use answers in demo: filter_seer –k answers/00/stout –pos_beta

(not run)
parallel --results answers -j 8 seer -k …/kmers/fsm_out{}.gz -p …/penicillin.pheno --struct …/population_structure/answers/projection --maf 0.05 ::: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15
for i in $(find . -name stdout); do filter_seer -k $i --pos_beta | sed ‘1d’ >> seer_filtered.txt; done

This runs all the seer jobs in parallel on the climb vm and then filters them. This takes about two hours so results are provided in answers/seer_filtered.txt

(downstream)
cd …/downstream

~/seer/scripts/hits_to_fastq.pl -k …/run_seer/answers/seer_filtered.txt -b 10e-8 > seer_filtered.fastq

bowtie2 -p 8 --reorder -q -U seer_filtered.fastq --ignore-quals -D 24 -R 3 -N 0 -L 7 -i S,1,0.50 -x Streptococcus_pneumoniae_ATCC_700669_v1 -S seer_filtered.sam

~/seer/scripts/mapping_to_phandango.pl -s seer_filtered.sam -k …/run_seer/answers/seer_filtered.txt -m 20 > climb_phandango.plot

Less climb_phandango.plot

  1. Converts significant kmers to a fastq for mapping
  2. Maps significant kmers to a reference genome
  3. Visualise as a Manhattan plot in phandango

Note the message from bowtie2 -> if kmers are unmapped another reference can easily be mapped to without repeating any of the long steps above. Therefore the pan-genome has been tested, not just the core

  1. on local machine
    (visualisation - local)
    Copy the gff and plot file locally and drag onto http://jameshadfield.github.io/phandango/
    scp ubuntu@137.205.69.99:~/seer_data/downstream/climb_phandango.plot Desktop
    scp ubuntu@137.205.69.99:~/seer_data/downstream/Streptococcus_pneumoniae_ATCC_700669_v1.gff Desktop

The four biggest peaks respectively are pbp2x, orf15 (in the ICE), pbp1a, pbp2b
All three penicillin binding proteins!
Bonus question for fans of the pneumococcus: why does the ICE appear to be associated with penicillin resistance?

Where is the seer executable?

It’s on the GWAS tutorial image - you need a custom server rather than GVL for this tutorial. The name of the image is CLIMB-GWAS-Tutorial.

1 Like

I think I need a username and password to get into the server once it’s launched…?

Is anyone else getting an error status when launching this?

Username will be ubuntu and you should use the keypair that you supplied to login. Follow the tutorial here for more details on SSH keys.

David – please try again later and/or launch in Birmingham while we free up some space in Warwick!

1 Like

Did you get it sorted?

No luck today from either bham or warwick. Server launches successfully but the status remains as “Error”

Try again on Warwick, should be good now!

Better. I can launch a server ok but now I’m getting prompted for a password? To confirm it should be:

ssh ubuntu@ipaddress
as the login and then it should automatically use the ssh key right? I managed a GVL without any issue.

Cheers
D

I’ve just had a look at this instance and there’s definitely a key present, likely one you specified as the attached info looks similar to your username!

You can specify an ssh key using the -i flag with ssh, so:

HTH

Perfect! Worked a charm. Thanks for that.

The ssh command also automagically searches for keys in a hidden folder in your home directory named “.ssh”
(the preprended . hides it from the standard ls - list directory contents - command, you can see hidden files/folders with ls -a)

To be found by ssh, the filename of your key has to be id_rsa. If a file called id_rsa is found in the .ssh directory, its automatically tried first - no need to specify an identity (key) file with ssh -i.

If you’re into that kind of thing, you can observe the interaction between your client and the host (VM) if you add the -v (verbose) flag to your ssh command. So, with your key called id_rsa in the .ssh directory in your home directory, you can run:

to use your default key and work out exactly what’s happening (or going wrong!).

Cheers Matt, Thought i’d deleted the old ones but it turned out I had a second key in the known_hosts file that was causing the issue. Like the verbose flag, definitely into that sort of thing!
Thanks again!
David

I can’t even get the disc image to load. Any suggestions?

Thanks

Mark

Hi Mark, can you explain a bit more about the behaviour you’re seeing and the steps I need to take to reproduce it?

With the settings as for below:

Nothing happens when I press the Launch Server button. Sometimes the graphic in the button spins thereafter, other times it spins once and then stops. Either way nothing else happens. I was able to get a GVL started but terminated it when I realised I needed the tutorial image.

I should add that I use Safari on a Mac. I tried Google Chrome in case it was a browser issue.

Mark

Great, thanks for the updated info Mark.

Based on the screenshot above, it looks like you haven’t created an SSH key for accessing your new instance. This goes in the final dropdown in the “Launch a custom server” dialog.

Thanks for bringing this to our attention, we’ll add some better signposting for this issue in Bryn shortly.