Running ClusTRace

Test ClusTRace and print options

perl clustrace.

Test ClusTRace by running all steps on sample data

perl clustrace.pl --fasta data/samples/delta-s1.fasta --res results/delta-s1 --pipe all -t 16 -v

Assign lineage to fasta sequences with Pangolin

perl clustrace.pl --fasta my/data/january.fasta --res results/January --pipe pangolin

Collect consensus sequences to multi-fasta files by assigned lineage (--res dir must contain lineage_report.csv generated in the previous step). Target analysis to Alfa and Beta variants of concern (VOC):

perl clustrace.pl --fasta my/data/january.fasta --res results/January --pipe collect --target B.1.1.7,B.1.351

Analyse collected multi-fasta: remove outliers + create MSAs + create trees

perl clustrace.pl --res results/January --pipe filter,align,tree -t 16 -v

Control outlier filtering: filter my seqlength 5% deviation from median + by >10% gaps

perl clustrace.pl --res results/January --minlen 95 --maxlen 105 --maxgap 10 --pipe f,a,t

Extract clusters for Alfa variant with TreeCluster. Clusters will be extracted with max-clade method at different max mutation rates. Pipeline will also create summary Excel table with cluster statistics and growth rates, and Nexus trees with clusters identified by node color and label.

perl clustrace.pl --res results/January --pipe cl --tperiod week --target B.1.1.7 -v

Create cluster MSA(s) , VCF files (Variant Call Format files) and VCF summaries. These will include both nucleotide and amino acid variants.

perl clustrace.pl --res results/January --pipe vcall --target B.1.1.7 --refvar data/lineage_variants.tab

Create lineage VCF files and summaries

perl clustrace.pl --res results/January --pipe vclineage --refvar data/lineage_variants.tab

Cleenup temporary files and pack results to a tarball

perl clustrace.pl --res results/January --pipe cleen,pack

ClusTRace command line options

Option	Value	[Default]	Function
--fasta	file		Input multifasta (.fa or .fasta)
--res	dir	results	Output directory
--log	dir	log	Directory for logging
--colpan	str	dark2	Color scheme for coloring clusters: rgb\|paired\|dark2 (for preview see [https://colorbrewer2.org/])
--minseqn	int	10	Lineage filtering: exclude lineages with seqn < minseqn
--minlen	int	90	Sequence outlier filtering: exclude sequences shorted than median_lengthminlen%*
--maxlen	int	110	Sequence outlier filtering: exclude sequences longer than median_lengthmaxlen%*
--maxgap	int	10	Sequence outlier filtering: exclude sequences with gaps% > maxgap%
--tree	str	iqtree	Run iqtree (IQ-Tree2 --mset GTR+F) or vftree (VeryFastTree --gtr -nt)
--ufboot		false	Run iqtree with ultrafast bootstrap and create consensus tree (IQ-Tree2 -B 1000 -bnni)
--trimal_gt	num	0.9	trimal -gt threshold. Used to trim MSAs before tree construction
--tperiod	str	month	Time period for cluster analysis. Accepted values: month\|week
--outgroup	file	data/NC_045512.fa	Fasta with outgroup sequence
--refgen	file	data/NC_045512.fa	Fasta with reference genome
--refvar	file		File with reference lineage variants. Format: lineageid \t gene1: var1,var2,..[,varn]; gene2: var1,var2,..[,varn] \n GISAID characteristic mutations for some SARS-CoV-2 lineages are available in data/lineage_variants.tab
--pipe	str	all	Comma-separated list of steps to perform, eg --pipe p,c,f,a
	p\|pangolin		Assign lineages with Pangolin. Lineage report is printed to --res dir
	c\|collect		Collect sequences for each lineage into multi-fasta
	f\|filter		Filter lineage multi-fasta
	a\|align		Create MSAs for each lineage multifasta in --res dir
	t\|tree		Create ML-trees for each lineage MSA in --res dir. Use --ufboot option to create concensus trees.
	cl\|clust		Extract clusters at various mutation rates
	vc\|vcall		Create MSA and VCF files for all clusters. Add vcf variants to cluster summary excel.
	vclineage		Create VCF files for all lineage MSAs in --res dir. Create excel summary with VCF variants for each lineage.
	pack		Pack results into a tarball. Tarball will be created to the root directory of --res dir.
	cleen		Cleen up space by removing all intermediate and temporary files.
	all		Run all steps
--target	str	false	Comma-separated list of target lineages to analyze (eg --target B.1.1.7). When omitted, will analyze all lineages
--numth	int	8	Number of threads
--short		true	Truncate sequence names to the first occurrence of "_"
-v		false	Run in verbal mode

What is the output of ClusTRace?

Results will be printed to --res dir.

File	--pipe step	Description
lineage.fa	collect	Multi-fasta with sequences for each lineage. After running --pipe filter outliers are excluded from these files.
lineage.fa.flt	filter	Multi-fasta with filtered (i.e. excluded) sequences for each lineage
lineage.fa.stats	filter	Statistics (length, gap content, ..) and applied filters for sequences in each lineage (tab-delimited format).
lineage.msa	align	MSA for each analysed lineage/multifasta
lineage.ml.tree	tree	Maximum likelihood tree for each analysed lineage/multifasta, newick format.
lineage.con.tree	tree	Bootstrap consensus tree for each analysed lineage/multifasta, newick format. Required options: --tree iqtree --ufboot
lineage.mr=x.nex	clust	Clusters for consensus (or ml) tree at mutation rate X highlighted in different colors, nexus tree file
lineage.cl.xlsx	clust	Clusters for consensus (or ml) tree at different mutation rates, Excel table
lineage.cluster_summary.xlsx	clust	Cluster summary for each lineage. Includes data sheets clustSeqN, clustSeqID, clustGR_MR=X, clustMutations_MR=X and clustMutationTable_MR=X.
sheet: clustSeqN	clust	Reports the number of sequences in each cluster for each time period
sheet: clustSeqID	clust	Reports sequence ids assigned to each cluster at each time period
sheet: clustGR_MR=X	clust	Reports cluster growth rates and support values
sheet: clustMutations_MR=X	vcall	Reports nt, aa, reference aa and non-reference aa mutations for each cluster. Reporting non-reference aa mutations requires option --refvar file.
sheet: clustMutationTable_MR=X	vcall	Reports aa mutations for the 10 fastest growing clusters in a binary matrix. Top row lists aa mutations in genomic order with non-refenrece mutations highlighted in bold.
		LEGEND: "period", date period (data from the first date to this date), "mr", mutation rate, "cluster", cluster id, "seqn", number of sequences assigned to this cluster, "subclustern", number of subclusters for this cluster, "support", bootstrap support
lineage.vcf	vclineage	Variant Call Format file (VCF) with nt and aa variants for each analysed lineage
lineageSummary.xlsx	vclineage	Variant summary for analysed lineaged. Includes data sheet lineageMutations.
sheet: lineageMutations	vclineage	Reports nt, aa, reference aa and non-reference aa mutations for each lineage. Reporting non-reference aa mutations requires option --refvar file.

Linking in-house data to the pipeline trees

In-house data can be exported from an excel file and displayed in any nexus or newick tree using a simple procedure:

Start by setting the first column in your excel sheet to the sequence/sample label displayed in the tree (displayed as ”Tip Labels” in FigTree). Set the header name of the first column to ”taxa”.
Export your excel sheet to a tab-delimited annotation file. For this select ”File->Export->Change File Type”. Select ”Text (Tab delimited)” and click ”Save As”.
Open any tree you wish to import to with FigTree. Select ”File -> Import Annotations”, select your exported annotation file and click ”Open”.
To view imported annotations: from FigTree menu select ”Tip Labels” and from the ”Display” drop-down menu select the field you wish to display (e.g. ”location”).

More information at https://bitbucket.org/plyusnin/clustrace/src/master/