Bacterial sequence reads supporting two nucleotides at same location

ambi1999 · February 24, 2017, 2:27pm

Hi,

We have sequenced four different mutated forms of a bacteria (Pseudomonas aeruginosa) and task is to find the snps which are exclusive to each sample. Aligned reads to a reference genome as a first attempt. Then did denovo and aligned reads to the denovo assembly (usign SPAdes). In both cases at quite a few locations more than one nucleotides are being supported by many reads. Just to add the fact that reads are only supporting maximum of two nucleotides at quite a few locations, and never three nucleotides. For example below is the read count from igv for a particular location.

"CP000744.1:54,471

Total count: 439 A : 265 (60%, 121+, 144- ) C : 0 G : 0 T : 174 (40%, 98+, 76- ) N : 0

How should we interpret this considering that bacteria are haploid?

First interpretation could be that there was sequencing error and correct sequence was A. Sequencing error to me seems unlikely because of such high numbers (265, 174) and also because same pattern is repeated in other locations as well.

Second interpretation could be there was contamination and more than one type of cells were present in the sample? This may be a possibility but I first want to make sure that I am not missing out on some other reason.

ps: I have asked this question at Biostar as well.
https://www.biostars.org/p/238739/

Thanks for reading post,
Ambi.

mbeale · February 24, 2017, 5:17pm

Is this sequencing of a single colony pick or a population (i.e. Option 3 - Could you have intrasample diversity)?

ambi1999 · February 28, 2017, 1:20am

Hi mbeale,

Thanks for your reply. Due to the unstable nature of the small colony variants we had to use more than one colony for each isolate, which means that even though we used a pre culture, because we used several colonies there might be some clonal variation within that pure culture. So intrasample diversity is one possibility but I am not sure how to interpret the results regarding snps exclusive to a sample. As an example following is the sequence count at a location for four different samples. The reference at this location was T. My task is to find the snps which are exclusive to each sample.

Reference at Location 54453 is T

SAMPLE 1: Wild type

"CP000744.1:54,453

Total count: 294 A : 0 C : 131 (45%, 53+, 78- ) G : 1 (0%, 0+, 1- ) T : 162 (55%, 95+, 67- ) N : 0 ---------------"

SAMPLE 2: MUTATED FORM 1:

Total count: 440 A : 2 (0%, 2+, 0- ) C : 264 (60%, 119+, 145- ) G : 0 T : 174 (40%, 100+, 74- ) N : 0 ---------------"

SAMPLE 3: MUTATED FORM 2:

Total count: 239 A : 0 C : 86 (36%, 42+, 44- ) G : 0 T : 153 (64%, 92+, 61- ) N : 0 ---------------"

SAMPLE 4: MUTATED FORM 3:

Total count: 231 A : 0 C : 91 (39%, 48+, 43- ) G : 0 T : 140 (61%, 73+, 67- ) N : 0 ---------------"

In “MUTATED FORM 1” C is 60% and T is 40%, in all other samples (Wild type, mutated form 2 and mutated form 3) T is almost 60% and C is almost 40%. How to interpret these results? Could we say that at this location “MUTATED FORM 1” has a snp while all other three remaining samples do not have snp at this location?

Could the interpretation be that the at this location a pure cell of “MUTATED FORM 1” should have C while other samples should have T. The T being present in “MUTATED FORM 1” are actually contamination (meaning these cells actually belong to other types namely wild or MUTATED FORM 2 or MUTATED FORM 3)."

Thx,
Ambi.