Examples

Quickstart Example

Lazypipe is available as a preinstalled module on Puhti server at the Finnish Center for Scientific Computing (CSC). To start using Lazypipe login to Puhti and type:

module load r-env-deprecated
module load biokit
module load lazypipe
sbatch-lazypipe -1 mydata/forward_reads.fastq -2 mydata/reverse_reads.fastq --hostgen mydata/host_genome.fna.gz --res result_directory --label result_subdirectory --pipe 1:10

The script will ask you to type in the accounting project, maximum duration of the job, memory reservation (min 4GB X number_of_cores recommended) and the number of cores reserved. This will create and submit a job scipt to the sbatch job system.

For more details please see CSC documentation for the Lazypipe module.

Example 1

Download data

In this example we will analyze 75k paired end Illumina library sequenced from a Mink feces sample. Download and unpack the sample library to your data directory.

cd /scratch/my_project/data/
wget https://bitbucket.org/plyusnin/lazypipe/downloads/M15.tar.gz
tar -xzvf M15.tar.gz

Download host genome for filtering host reads (no need to unpack). For this sample we will use Neovison vison (American mink) assembly.

cd /scratch/my_project/genomes_host
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/108/605/GCA_900108605.1_NNQGG.v01/GCA_900108605.1_NNQGG.v01_genomic.fna.gz

Run Lazypipe

Now create example1.bash file that will execute pipeline steps 1 to 7 and 9 to 11. Here we intentionally skip step 8 (i.e. IGV report), which requires installation of viral reference genomes. If you wish to include that, install viral reference genomes and igv-reports; and change --pipe 1:7,9:11  to --pipe 1:11.

#!/bin/bash -l
#SBATCH --job-name=lazytest
#SBATCH --account=my_project
#SBATCH --time=02:00:00
#SBATCH --mem-per-cpu=4G
#SBATCH --cpus-per-task=32
#SBATCH --partition=small
module load r-env
echo "TMPDIR=/scratch/my_project/lazypipe/wrkdir" > .Renviron
module load python-env
module load biokit
srun /projappl/my_project/lazypipe/pipeline.pl \
-1 /scratch/my_project/data/M15/M15_R1.fastq \
--hostgen /scratch/my_project/genomes_host/GCA_900108605.1_NNQGG.v01_genomic.fna.gz \
--res /scratch/my_project/results --label M15 \
--numth $SLURM_CPUS_PER_TASK --inlen 300 \
--pipe 1:7,11:13 &> /scratch/my_project/results/M15.log

Now you can execute the Lazypipe analysis by calling sbatch:

sbatch example1.bash

Note, that on the first run Lazypipe will call bwa index for your host genome. For large genomes this may take up to 2 hours and can not be parallelized. However, indexing is done only once for each host genome.

Results for example 1

All results are printed to directory specified by --res and --label, in this case to /scratc/my_project/results/M15/. After completion this directory should include the following files (click to preview):

This excel file contains tables with taxon abundancies. Abundancies are displayed in separate tables for viruses (excluding bacteriophages), bacteria, bacteriophages and eukarya. For each domain abundancies are displayed at three taxonomic levels: species, genus and family.
Tables in this file contain information on database homologs found for assembled contigs. Tables are displayed separately for viruses, bacteria and eukarya. Columns displayed depend on the applied homology search (sans/blastp/centrifuge).
summary-M15.html IGV reports are interactive graphical reports that display location and variation in viral contigs relative to reference genomes. Click on the link to the left to explore IGV report for the M15 sample.
IGV reports are printed by "--pipe 8"-option and require installed reference genomes. IGV reports will be printed for contigs with homologs matching reference genomes in your local database.
krona_graph.html

Taxonomy profiles are also displayed as an interactive Krona graph. Click on the link to the left to explore Korna Graph for the M15 sample.

Taxonomic abundancies in CAMI Profiling Output Format (Sczyrba A. et al. 2017). With this output format we support standardised evaluation and integration with automated workflows.

qc.r1hist.jpeg
qc.r2hist.jpeg
qc.conhits.jpeg
qc.rsurv.jpeg

Quality Control (QC) plots include length histograms for reads and contigs, and survival plots. The survival plots track how much of the read data is retained after each step in the pipeline: after preprocessing/host-filtering/assembling/gene-prediction.
contigs

This directory will contain assembled contigs in fasta format sorted into a directory structure that matches the assigned taxonomy. In the M15 sample Lazypipe found 3 contigs assigned to viral species (excluding bacteriophages): 2 contigs were assigend to Mamastrovirus 10  and 1 to Mink circovirus.

contigs/Viruses/Astroviridae/Mamastrovirus.fa
This file will contain all contigs assigned to the genus Mamastrovirus. In this example Mamastrovirus.fa will contain the two contigs assigned to Mamastrovirus 10.

contigs/Viruses/Circoviridae/Circovirus.fa
This file will contain all contigs assigned to the genus Circovirus. In this example Circovirus.fa will contain the single contig assigned to Mink circovirus.

Contigs for bacteria, bacteriophages and eukarya will we assigned to their respective subdirectories. Contigs for which no homologs were found will be assigned to the "unknown" subdirectory.

    Example 2

    In this example we will analyze public Illumina HiSeq/MiSeq libraries sequenced from five patients at the early stage of SARS2 outbreak in Wuhan, China. For more information see NCBI BioProject PRJNA605983.

    Download data

    In this example we will use NCBI SRA Toolkit to download NGS libraries. SRA Toolkit is available on CSC as part of the biokit module. Other users can install SRA Toolkit from NCBI website.

    Start by configuring SRA Toolkit with vdb-config utility (included in the kit). Set SRA Toolkit download directory to /scratch/my_project/my_sra/ or any other convenient location:

    module load biokit
    vdb-config -i

    Now download any SRA library for project PRJNA605983 (NCBI accession numbers SRR11092056-SRR11092064). In the following example code we will use SRR11092062 sequenced from sample WIV04-2. After downloading dump fastq files to /scratch/my_project/my_sra/reads or any other convenient location.

    module load biokit
    prefetch SRR11092062
    mkdir /scratch/my_project/my_sra/reads/
    fastq-dump --split-files --outdir /scratch/my_project/my_sra/reads/

    Now create example2.bash file that will execute pipeline steps 1 to 7 and 9 to 11.

    #!/bin/bash -l
    #SBATCH --job-name=sars2_wiv04_2
    #SBATCH --account=my_project
    #SBATCH --time=06:00:00
    #SBATCH --mem-per-cpu=8G
    #SBATCH --cpus-per-task=32
    #SBATCH --partition=small
    module load r-env
    echo "TMPDIR=/scratch/my_project/lazypipe/wrkdir" > .Renviron
    module load python-env
    module load biokit
    srun /projappl/my_project/lazypipe/pipeline.pl \
    -1 /scratch/my_project/my_sra/reads/SRR11092062_1.fastq \
    -2 /scratch/my_project/my_sra/reads/SRR11092062_2.fastq \
    --hostgen /scratch/my_project/genomes_host/GCA_000001405.15_GRCh38_genomic.fna.gz \
    --res /scratch/my_project/results --label sars2_wiv04_2 \
    --numth $SLURM_CPUS_PER_TASK \
    --pipe 1:7,11:13 &> /scratch/my_project/results/sars2_wiv04_2.log

    Now you can execute the analysis by calling sbatch:

    sbatch example2.bash

    Results for example 2

    All results are printed to /scratc/my_project/results/sars2_wiv04_2/.

    For the list of printed reports please see example 1.

    The following table displays virus abundancies reported by Lazypipe for SARS positive libraries in project PRJNA605983.

    SRA run SRA experiment Platform Library Virus Taxid readn readn% csumq contign
    SRR11092063 SRX7730880 RNA-Seq Illumina HiSeq 3000 WIV02-2 Severe acute respiratory syndrome-related coronavirus 694009 559 0.3685% 1 23
    SRR11092057 SRX7730886 RNA-Seq Illumina MiSeq WIV04 Severe acute respiratory syndrome-related coronavirus 694009 732 13.0878% 1 15
    SRR11092062 SRX7730881 RNA-Seq Illumina HiSeq 1000 WIV04-2 Severe acute respiratory syndrome-related coronavirus 694009 5918 3.0027% 1 1
    SRR11092062 SRX7730881 RNA-Seq Illumina HiSeq 1000 WIV04-2 Influenza A virus 11320 274 0.1390% 1 2
    SRR11092062 SRX7730881 RNA-Seq Illumina HiSeq 1000 WIV04-2 Autographa californica multiple nucleopolyhedrovirus 307456 205 0.1040% 1 2
    SRR11092061 SRX7730882 RNA-Seq Illumina HiSeq 3000 WIV05 Severe acute respiratory syndrome-related coronavirus 694009 234 0.0510% 1 20
    SRR11092061 SRX7730882 RNA-Seq Illumina HiSeq 3000 WIV05 Saccharomyces 20S RNA narnavirus 186772 135 0.0294% 2 1
    SRR11092060 SRX7730883 RNA-Seq Illumina HiSeq 3000 WIV06-2 Severe acute respiratory syndrome-related coronavirus 694009 525 0.1417% 1 22
    SRR11092060 SRX7730883 RNA-Seq Illumina HiSeq 3000 WIV06-2 Spodoptera frugiperda rhabdovirus 1481139 165 0.0445% 1 1
    SRR11092060 SRX7730883 RNA-Seq Illumina HiSeq 3000 WIV06-2 Saccharomyces 20S RNA narnavirus 186772 103 0.0278% 2 3
    SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Influenza A virus 11320 9063 0.0974% 1 4
    SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Saccharomyces 20S RNA narnavirus 186772 3386 0.0364% 1 1
    SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Severe acute respiratory syndrome-related coronavirus 694009 819 0.0088% 2 16
    SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Bamboo mosaic virus 35286 325 0.0035% 2 1
    SRR11092059 SRX7730884 RNA-Seq Illumina HiSeq 3000 WIV07-2 Spodoptera frugiperda rhabdovirus 1481139 168 0.0018% 2 1

    References:

    [1] Sczyrba A, Hofmann P, Belmann P, et al. Critical assessment of metagenome interpretation—a benchmark of metagenomics software. Nature methods. 2017;14:1063.