Skip to content

Salmonella enterica Enteriditis surveillance

Quality control

All sequences submitted to AusTrakka undergo quality assessment to determine the suitability of the sequences for further analysis.

QC metric Criteria Bioinformatics tool
Species observed in sequence data Most abundant species Salmonella enterica kraken2 with latest pluspf database
Serotype detected Enteritidis sistr
Estimated genome size (reads) 4.7 Mbp +/- 10% kmc
Assembled genome size 4.7 Mbp +/- 10% seqkit stats
Estimated average depth of coverage 40x number reads / esitmated genome size (reads)

Sequences which pass these criteria are used in downstream analsyis.

Single sample results

MLST and detection of AMR mechanisms is also undertaken, using abritamr on all sequences which pass the quality assessment.

Core genome MLST

Core genome MLST is used to group sequences into cgMLST groups for further detailed analysis. This approach allows for maximisation of core genome recovery and high resolution interpretations of relatedness to aid in informing the degree of relatedness.

cgMLST as implemented by the ANAT, utilises a scheme of 3016 genes and the allele calling tool chewBBACA to detect profiles.

Only sequences which have >90% alleles detected are used in further analysis.

Distances between profiles is calculated with cgmlst-dists (v1.2.0) and these distances are clustered using heirarchical complete linkage with a 50 allele threshold. This method identifies groups where all sequences within the group are <= 50 alleles from every other sequence in the group.

Core SNP analysis

For each cgMLST group (cgT) which contains >= 5 sequences a comparative core genome SNP analysis is undertaken using the reference genome NC 011294.1 Salmonella enterica subsp. enterica serovar Enteritidis str. P125109.

Paired-end reads are aligned, variants identified and a core genome calculated using snippy. Pairwise SNP distances and phylogenetic trees are calculated using snp-dists.

Interpetation of genomic relatedness

For each cgT, heirarchical singl-linkage clustering with a 5 SNP treshold is used to determine the degree of genomic relatedness. This threshold has been shown to identify epidemiologically relevant relationships. Single-linkage clustering identifies groups of sequences which are <= 5 SNPs from at least one other sequence in the cgT.

Software versions

All tools (except cgMLST) are implemented within the bohra pipeline v 3.4.3

Tool Version Reference Database
chewBBACA v 2.16.0 Publication - https://academic.oup.com/nar/article/49/D1/D660/5929238?login=false Tool - https://chewbbaca.readthedocs.io/en/latest/user/getting_started/overview.html https://zenodo.org/records/1323684 downloaded 2023-10-01
seqkit seqkit v2.12.0 Publication - https://doi.org/10.1002/imt2.191 Tool - https://bioinf.shenwei.me/seqkit/
kmc K-Mer Counter (KMC) ver. 3.2.4 (2024-02-09) Publication - https://doi.org/10.1093/bioinformatics/btx304 Tool - https://github.com/refresh-bio/KMC
shovill shovill 1.4.2 Tool - https://github.com/tseemann/shovill
kraken2 Kraken version 2.17.1 Publication - https://doi.org/10.1186/s13059-019-1891-0 Tool - https://github.com/DerrickWood/kraken2 /opt/resources/k2_pluspfp_20240605
abritamr abritamr 1.0.20 Publication - https://doi.org/10.1038/s41467-022-35713-4 Tool - https://zenodo.org/records/12514579
mobsuite mob_recon 3.1.9 Publication - https://doi.org/10.1099/mgen.0.000435 Tool - https://github.com/phac-nml/mob-suite
mlst mlst 2.19 Tool - https://github.com/tseemann/mlst
sistr (sistr 1.1.1) Tool - https://github.com/phac-nml/sistr_cmd