Salmonella enterica Enteriditis surveillance

Quality control

All sequences submitted to AusTrakka undergo quality assessment to determine the suitability of the sequences for further analysis.

QC metric	Criteria	Bioinformatics tool
Species observed in sequence data	Most abundant species Salmonella enterica	kraken2 with latest pluspf database
Serotype detected	Enteritidis	sistr
Estimated genome size (reads)	4.7 Mbp +/- 10%	kmc
Assembled genome size	4.7 Mbp +/- 10%	seqkit stats
Estimated average depth of coverage	40x	number reads / esitmated genome size (reads)

Sequences which pass these criteria are used in downstream analsyis.

Single sample results

MLST and detection of AMR mechanisms is also undertaken, using abritamr on all sequences which pass the quality assessment.

Core genome MLST

Core genome MLST is used to group sequences into cgMLST groups for further detailed analysis. This approach allows for maximisation of core genome recovery and high resolution interpretations of relatedness to aid in informing the degree of relatedness.

cgMLST as implemented by the ANAT, utilises a scheme of 3016 genes and the allele calling tool chewBBACA to detect profiles.

Only sequences which have >90% alleles detected are used in further analysis.

Distances between profiles is calculated with cgmlst-dists (v1.2.0) and these distances are clustered using heirarchical complete linkage with a 50 allele threshold. This method identifies groups where all sequences within the group are <= 50 alleles from every other sequence in the group.

Core SNP analysis

For each cgMLST group (cgT) which contains >= 5 sequences a comparative core genome SNP analysis is undertaken using the reference genome NC 011294.1 Salmonella enterica subsp. enterica serovar Enteritidis str. P125109.

Paired-end reads are aligned, variants identified and a core genome calculated using snippy. Pairwise SNP distances and phylogenetic trees are calculated using snp-dists.

Interpetation of genomic relatedness

For each cgT, heirarchical singl-linkage clustering with a 5 SNP treshold is used to determine the degree of genomic relatedness. This threshold has been shown to identify epidemiologically relevant relationships. Single-linkage clustering identifies groups of sequences which are <= 5 SNPs from at least one other sequence in the cgT.

Software versions

All tools (except cgMLST) are implemented within the bohra pipeline v 3.4.3

Tool	Version	Reference	Database
chewBBACA	v 2.16.0	Publication - https://academic.oup.com/nar/article/49/D1/D660/5929238?login=false Tool - https://chewbbaca.readthedocs.io/en/latest/user/getting_started/overview.html	https://zenodo.org/records/1323684 downloaded 2023-10-01
seqkit	seqkit v2.12.0	Publication - https://doi.org/10.1002/imt2.191 Tool - https://bioinf.shenwei.me/seqkit/
kmc	K-Mer Counter (KMC) ver. 3.2.4 (2024-02-09)	Publication - https://doi.org/10.1093/bioinformatics/btx304 Tool - https://github.com/refresh-bio/KMC
shovill	shovill 1.4.2	Tool - https://github.com/tseemann/shovill
kraken2	Kraken version 2.17.1	Publication - https://doi.org/10.1186/s13059-019-1891-0 Tool - https://github.com/DerrickWood/kraken2	/opt/resources/k2_pluspfp_20240605
abritamr	abritamr 1.0.20	Publication - https://doi.org/10.1038/s41467-022-35713-4 Tool - https://zenodo.org/records/12514579
mobsuite	mob_recon 3.1.9	Publication - https://doi.org/10.1099/mgen.0.000435 Tool - https://github.com/phac-nml/mob-suite
mlst	mlst 2.19	Tool - https://github.com/tseemann/mlst
sistr	(sistr 1.1.1)	Tool - https://github.com/phac-nml/sistr_cmd