Running ClusTRace

Test ClusTRace and print options

perl clustrace.

pl

Test ClusTRace by running all steps on sample data

perl clustrace.pl --fasta data/samples/delta-s1.fasta --res results/delta-s1 --pipe all -t 16 -v

Assign lineage to fasta sequences with Pangolin

perl clustrace.pl --fasta my/data/january.fasta --res results/January --pipe pangolin

Collect consensus sequences to multi-fasta files by assigned lineage (--res dir must contain lineage_report.csv generated in the previous step). Target analysis to Alfa and Beta variants of concern (VOC):

perl clustrace.pl --fasta my/data/january.fasta --res results/January --pipe collect --target B.1.1.7,B.1.351

Analyse collected multi-fasta: remove outliers + create MSAs + create trees

perl clustrace.pl --res results/January --pipe filter,align,tree -t 16 -v

Control outlier filtering: filter my seqlength 5% deviation from median + by >10% gaps

perl clustrace.pl --res results/January --minlen 95 --maxlen 105 --maxgap 10 --pipe f,a,t

Extract clusters for Alfa variant with TreeCluster. Clusters will be extracted with max-clade method at different max mutation rates. Pipeline will also create summary Excel table with cluster statistics and growth rates, and Nexus trees with clusters identified by node color and label.

perl clustrace.pl --res results/January --pipe cl --tperiod week --target B.1.1.7 -v

Create cluster MSA(s) , VCF files (Variant Call Format files) and VCF summaries. These will include both nucleotide and amino acid variants.

perl clustrace.pl --res results/January --pipe vcall --target B.1.1.7 --refvar data/lineage_variants.tab

Create lineage VCF files and summaries

perl clustrace.pl --res results/January --pipe vclineage --refvar data/lineage_variants.tab

Cleenup temporary files and pack results to a tarball

perl clustrace.pl --res results/January --pipe cleen,pack

ClusTRace command line options

Option Value [Default] Function
--fasta file Input multifasta (*.fa or *.fasta)
--res dir results Output directory
--log dir log Directory for logging
--colpan str dark2 Color scheme for coloring clusters: rgb|paired|dark2 (for preview see [https://colorbrewer2.org/])
--minseqn int 10 Lineage filtering: exclude lineages with seqn < minseqn
--minlen int 90 Sequence outlier filtering: exclude sequences shorted than median_length*minlen%
--maxlen int 110 Sequence outlier filtering: exclude sequences longer than median_length*maxlen%
--maxgap int 10 Sequence outlier filtering: exclude sequences with gaps% > maxgap%
--tree str iqtree Run iqtree (IQ-Tree2 --mset GTR+F) or vftree (VeryFastTree --gtr -nt)
--ufboot false Run iqtree with ultrafast bootstrap and create consensus tree (IQ-Tree2 -B 1000 -bnni)
--trimal_gt num 0.9 trimal -gt threshold. Used to trim MSAs before tree construction
--tperiod str month Time period for cluster analysis. Accepted values: month|week
--outgroup file data/NC_045512.fa

Fasta with outgroup sequence

--refgen file data/NC_045512.fa Fasta with reference genome
--refvar file File with reference lineage variants.
Format: lineageid \t gene1: var1,var2,..[,varn]; gene2: var1,var2,..[,varn] \n
GISAID characteristic mutations for some SARS-CoV-2 lineages are available in data/lineage_variants.tab
--pipe str all Comma-separated list of steps to perform, eg --pipe p,c,f,a
p|pangolin Assign lineages with Pangolin. Lineage report is printed to --res dir
c|collect Collect sequences for each lineage into multi-fasta
f|filter Filter lineage multi-fasta
a|align Create MSAs for each lineage multifasta in --res dir
t|tree Create ML-trees for each lineage MSA in --res dir.
Use --ufboot option to create concensus trees.
cl|clust Extract clusters at various mutation rates
vc|vcall Create MSA and VCF files for all clusters. Add vcf variants to cluster summary excel.
vclineage Create VCF files for all lineage MSAs in --res dir. Create excel summary with VCF variants for each lineage.
pack Pack results into a tarball. Tarball will be created to the root directory of --res dir.
cleen Cleen up space by removing all intermediate and temporary files.
all Run all steps
--target str false Comma-separated list of target lineages to analyze (eg --target B.1.1.7). When omitted, will analyze all lineages
--numth int 8 Number of threads
--short true Truncate sequence names to the first occurrence of "_"
-v false Run in verbal mode

What is the output of ClusTRace?

Results will be printed to --res dir.

File --pipe step Description
lineage.fa collect Multi-fasta with sequences for each lineage. After running --pipe filter outliers are excluded from these files.
lineage.fa.flt filter Multi-fasta with filtered (i.e. excluded) sequences for each lineage
lineage.fa.stats filter Statistics (length, gap content, ..) and applied filters for sequences in each lineage (tab-delimited format).
lineage.msa align MSA for each analysed lineage/multifasta
lineage.ml.tree tree Maximum likelihood tree for each analysed lineage/multifasta, newick format.
lineage.con.tree tree Bootstrap consensus tree for each analysed lineage/multifasta, newick format. Required options: --tree iqtree --ufboot
lineage.mr=x.nex clust Clusters for consensus (or ml) tree at mutation rate X highlighted in different colors, nexus tree file
lineage.cl.xlsx clust Clusters for consensus (or ml) tree at different mutation rates, Excel table
lineage.cluster_summary.xlsx clust Cluster summary for each lineage. Includes data sheets clustSeqN, clustSeqID, clustGR_MR=X, clustMutations_MR=X and clustMutationTable_MR=X.
sheet: clustSeqN clust Reports the number of sequences in each cluster for each time period
sheet: clustSeqID clust Reports sequence ids assigned to each cluster at each time period
sheet: clustGR_MR=X clust Reports cluster growth rates and support values
sheet: clustMutations_MR=X vcall Reports nt, aa, reference aa and non-reference aa mutations for each cluster. Reporting non-reference aa mutations requires option --refvar file.
sheet: clustMutationTable_MR=X vcall Reports aa mutations for the 10 fastest growing clusters in a binary matrix. Top row lists aa mutations in genomic order with non-refenrece mutations highlighted in bold.
LEGEND: "period", date period (data from the first date to this date), "mr", mutation rate, "cluster", cluster id, "seqn", number of sequences assigned to this cluster, "subclustern", number of subclusters for this cluster, "support", bootstrap support
lineage.vcf vclineage Variant Call Format file (VCF) with nt and aa variants for each analysed lineage
lineageSummary.xlsx vclineage Variant summary for analysed lineaged. Includes data sheet lineageMutations.
sheet: lineageMutations vclineage Reports nt, aa, reference aa and non-reference aa mutations for each lineage. Reporting non-reference aa mutations requires option --refvar file.

Linking in-house data to the pipeline trees

In-house data can be exported from an excel file and displayed in any nexus or newick tree using a simple procedure:

  1. Start by setting the first column in your excel sheet to the sequence/sample label displayed in the tree (displayed as ”Tip Labels” in FigTree). Set the header name of the first column to ”taxa”.
  2. Export your excel sheet to a tab-delimited annotation file. For this select ”File->Export->Change File Type”. Select ”Text (Tab delimited)” and click ”Save As”.
  3. Open any tree you wish to import to with FigTree. Select ”File -> Import Annotations”, select your exported annotation file and click ”Open”.
  4. To view imported annotations: from FigTree menu select ”Tip Labels” and from the ”Display” drop-down menu select the field you wish to display (e.g. ”location”).

More information at https://bitbucket.org/plyusnin/clustrace/src/master/