Running ClusTRace

Test ClusTRace and print options

perl clustrace.pl

Assign lineage to fasta sequences with Pangolin

perl clustrace.pl --fasta fasta_dir --res res_January --pipe pangolin

Collect consensus sequences to multi-fasta files by assigned lineage (lineage and fasta filenames must partially match). Target analysis to Alfa and Beta variants of concern (VOC)

perl clustrace.pl --fasta fasta_dir --res res_January --pipe collect --target B.1.1.7,B.1.351

Analyse collected multi-fasta: remove outliers + create MSAs + create trees

perl clustrace.pl --res res_January --pipe filter,align,tree

Control outlier filtering: filter my seqlength 5% deviation from median + by >10% gaps

perl clustrace.pl --res res_January --minlen 95 --maxlen 105 --maxgap 10 --pipe f,a,t

Extract clusters for Alfa variant with TreeCluster. Clusters will be extracted with max-clade method at different max mutation rates. Pipeline will also create summary Excel table with cluster statistics and growth rates, and Nexus trees with clusters identified by node color and label.

perl clustrace.pl --res res_January --pipe cl --tperiod week --target B.1.1.7

Update analysis January data with novel sequences. MSA(s) and tree(s) will be update. Clusters from January will be matched to new tree(s) to cover new sequences.

perl clustrace.pl --ref res_January --res res_February --fasta fasta_February --numth 16 --pipe c,f,a,t,cl

Cleenup temporary files and pack results to a tarball

perl clustrace.pl --res res_January --pipe cleen,pack

ClusTRace command line options

Option Value [Default] Function
--fasta dir   Directory with fasta and lineage-files. Fasta format: one sequence per fasta file with suffix *.fasta/*.fa. Lineage format: each lineage file must contain column header labelled "lineage" and have filename with suffix *lineage.csv. NOTE: lineage and fasta filenames must have matching prefixes up to the first "_" char.
--res dir   Output directory.
--ref dir dir   Reference directory with reference MSA(s), Tree(s) and clusters (see --pipe options).
--col str #FF0000 Hexadecimal color code for highlighting new sequences when run in reference mode
--colpan str dark2 Color scheme for coloring clusters: rgb|paired|dark2 (for preview see [https://colorbrewer2.org/])
--minseqn int 10 Lineage filtering: exclude lineages with seqn<minseqn
--minlen int 90 Sequence outlier filtering: exclude sequences shorted than median_length*minlen%
--maxlen int 110 Sequence outlier filtering: exclude sequences longer than median_length*maxlen%
--maxgap int 10 Sequence outlier filtering: exclude sequences with gaps% > maxgap%
--tree str iqtree Run iqtree (IQ-Tree2 --mset GTR+F) or vftree (VeryFastTree --gtr -nt)
--ufboot   false Run iqtree with ultrafast bootstrap and create consensus tree (IQ-Tree2 -B 1000 -bnni)
--trimal_gt num 0.9 trimal -gt threshold. Used to trim MSAs before tree construction.
--tperiod str month Time period for cluster analysis. Accepted values: month|week
--outgroup file or dir   A file or a directory with outgroup sequence(s). If --outgroup is a directory the pipeline will look for an outgroup fasta file for each lineage. This can be an exact match (e.g. dir/B.1.1.7.fasta) or a less specific match (e.g. dir/B.1.fasta for B.1.1.7 lineage). If there is no matching outgroup file for a lineage, a default outgroup file is used. Each outgroup file must contain exactly one reference sequence. Outgroup files must be specified and identical for both --pipe align and --pipe tree calls
--refgen file data/NC_045512.fa Fasta file to be used as the reference genome in variant calling
--refvar file   File with reference lineage variants.
Format: lineageid \t gene1: var1,var2,..[,varn]; gene2: var1,var2,..[,varn] \n
--pipe str all Comma-separated list of steps to perform, eg --pipe p,c,f,a
  p|pangolin   Assign lineages with Pangolin. Lineage files are prited to --fasta dir
  c|collect   Collect sequences for each lineage into multi-fasta. WARNING: this will clean up --res dir.
  f|filter   Filter multi-fasta
  a|align   Create MSA(s). In reference mode will add multi-fasta in --res dir to MSA for the same lineage in --ref dir (if available).
  t|tree   Create tree(s). In reference mode will constrain new trees by trees for the same lineage in the --ref dir.
  cl|clust   Extract clusters at various mutation rates. In reference mode will match clusters between --res and --ref dirs.
  vc|vcall   Create MSA and VCF files for all clusters. Add vcf variants to cluster summary excel.
  vclineage   Create VCF files for all lineage MSA(s) in --res dir. Create excel summary with VCF variants for each lineage.
  pack   Pack results into a tarball. Tarball will be created to the root directory of --res dir.
  cleen   Cleen up space by removing all intermediate and temporary files.
  all   Run all steps
--target str false Comma-separated list of target lineages to analyze (eg --target B.1.1.7). When omitted, will analyze all lineages
--numth int 8 Number of threads
--short   true Truncate sequence names to the first occurrence of "_"
-v   false Run in verbal mode

What is the output of ClusTRace?

Results will be printed to --res dir.

File --pipe step Description
lineage.fa collect Multi-fasta with sequences for each lineage. After running "--pipe filter" outliers are excluded from these files.
lineage.fa.flt filter Multi-fasta with filtered (i.e. excluded) sequences for each lineage
lineage.fa.stats filter Statistics (length, gap content, ..) and applied filters for sequences in each lineage (tab-delimited format).
lineage.msa align MSA for each analysed lineage/multifasta
lineage.ml.tree tree Maximum likelihood tree for each analysed lineage/multifasta, newick format.
lineage.con.tree tree Bootstrap consensus tree for each analysed lineage/multifasta, newick format. Required options: --tree iqtree --ufboot
lineage.ml.tree.nex tree lineage.ml.tree with new sequences highlighted, nexus format. Required options: --ref dir
lineage.con.tree.nex tree lineage.con.tree with new sequences highlighted, nexus format. Required options: --ref dir
lineage.mr=x.nex clust Clusters for consensus (or ml) tree at mutation rate x highlighted in different colors, nexus tree file
lineage.cl.xlsx clust Clusters for consensus (or ml) tree at different mutation rates, Excel table
lineage.cluster_summary.xlsx clust Cluster summary for each lineage. Includes data sheets clustSeqN, clustSeqID, clustGR_MR=X, clustMutations_MR=X and clustMutationTable_MR=X.
sheet: clustSeqN clust Reports the number of sequences in each cluster for each time period
sheet: clustSeqID clust Reports sequence ids assigned to each cluster at each time period
sheet: clustGR_MR=X clust Reports cluster growth rates and support values
sheet: clustMutations_MR=X vcall Reports nt, aa, reference aa and non-reference aa mutations for each cluster. Reporting non-reference aa mutations requires option --refvar file.
sheet: clustMutationTable_MR=X vcall Reports aa mutations for the 10 fastest growing clusters in a binary matrix. Top row lists aa mutations in genomic order with non-refenrece mutations highlighted in bold.
    LEGEND: "period", date period (data from the first date to this date), "mr", mutation rate, "cluster", cluster id, "seqn", number of sequences assigned to this cluster, "subclustern", number of subclusters for this cluster, "support", bootstrap support
lineage.vcf vclineage Variant Call Format file (VCF) with nt and aa variants for each analysed lineage
lineageSummary.xlsx vclineage Variant summary for analysed lineaged. Includes data sheet lineageMutations.
sheet: lineageMutations vclineage Reports nt, aa, reference aa and non-reference aa mutations for each lineage. Reporting non-reference aa mutations requires option --refvar file.

Linking in-house data to the pipeline trees

In-house data can be exported from an excel file and displayed in any nexus or newick tree using a simple procedure:

  1. Start by setting the first column in your excel sheet to the sequence/sample label displayed in the tree (displayed as ”Tip Labels” in FigTree). Set the header name of the first column to ”taxa”.
  2. Export your excel sheet to a tab-delimited annotation file. For this select ”File->Export->Change File Type”. Select ”Text (Tab delimited)” and click ”Save As”.
  3. Open any tree you wish to import to with FigTree. Select ”File -> Import Annotations”, select your exported annotation file and click ”Open”.
  4. To view imported annotations: from FigTree menu select ”Tip Labels” and from the ”Display” drop-down menu select the field you wish to display (e.g. ”location”).

More information at https://bitbucket.org/plyusnin/clustrace/src/master/