Getting started¶
Requirements¶
- Nextflow ≥ 23.0
- Singularity (or Apptainer) — all pipeline software runs in containers
- A POSIX-compatible file system accessible from all compute nodes (for SLURM execution)
- An NCBI API key (highly recommended when using
--SRA_index)
Step 1: Download the latest release¶
curl -sL https://github.com/crolllab/genomepanel_nf/archive/refs/tags/latest.tar.gz | tar xz
mv genomepanel_nf-latest genomepanel_nf
cd genomepanel_nf
The latest tag is automatically updated with every new release.
Alternative
You can also browse all releases at github.com/crolllab/genomepanel_nf/releases and download a specific version.
Step 2: Set up Nextflow¶
Step 3: Singularity images¶
All container images are pulled automatically by Nextflow on the first run. No manual download is required. Images are fetched from quay.io/biocontainers and cached locally so subsequent runs reuse them without re-downloading.
By default the cache is stored in $HOME/.singularity/cache. On HPC clusters where /home is not shared across worker nodes, point the cache to a shared filesystem in nextflow.config:
Example nextflow.config snippet if only /scratch is shared:
See HPC usage & utilities for further Singularity/Apptainer configuration notes.
Step 4: Try the example dataset¶
The repository ships with a small E. coli LTEE dataset that lets you verify your setup end-to-end before working with your own data. The input files live in example/.
Run from the repository root:
nextflow run main.nf \
--reference example/ecoli_REL606.fasta \
--reads "example/fastq/SRR*_{1,2}.fastq.gz" \
--ploidy 1 \
--outdir example/output
The run completes in roughly 10–60 minutes on a local machine and produces output in example/output/.
Step 5: Prepare your inputs¶
Reference genome¶
The reference genome must be in FASTA format (.fasta, .fa, .fna, or .fas extensions are all accepted).
Local FASTQ files¶
The --reads glob pattern must be bracketed by single quotes and must capture paired files. Common patterns:
# All files ending with _1.fq.gz and _2.fq.gz
--reads '/path/to/reads/*{1,2}.fq.gz'
# Including sub-directories
--reads '/path/to/reads/**{1,2}.fq.gz'
# Files ending with _1.fq.gz, _2.fq.gz OR _R1.fq.gz, _R2.fq.gz
--reads '/path/to/reads/*_{,R}{1,2}.fq.gz'
# Flexible pattern covering most Illumina naming conventions
--reads '/path/to/reads/**_{,R}{1,2}{,_001,_001_*}.{fq,fastq}.gz'
SRA/ENA accessions¶
Create a plain-text file with one accession per line. The pipeline accepts PRJNA, SRP, SRX, SRR, and ERR accessions:
The repository includes an example file SRA_accessions.txt.
Sample name mapping (optional)¶
Use a CSV file to assign custom names and merge multiple SRR runs per sample:
Pass it with --SRR_sample_map sample_map.csv. The repository includes an example sample_map.csv.
Step 6: Run the pipeline¶
Start the pipeline inside a tmux or screen session — the Nextflow process must stay alive until completion, even with -profile slurm.
# Run with SLURM, mixed local fastq (e.g. named _1.fq.gz and _2.fq.gz) + SRA input
export NCBI_API_KEY=your_ncbi_api_key_here
nextflow run main.nf -config nextflow.config -profile slurm \
--NCBI_API_key $NCBI_API_KEY \
--reference /path/to/reference_genome.fasta \
--ploidy 2 \
--reads '/path/to/reads/*{1,2}.fq.gz' \
--SRA_index './SRA_accessions.txt'
To resume a previous run (if the work-dir is still intact), simply add -resume to the command. Nextflow will skip completed steps and only execute missing ones.
Storage
The pipeline can require many TB of temporary storage during execution. Point -work-dir to a fast scratch filesystem. The pipeline aggressively cleans up temporary files to minimise final storage, but this makes -resume less effective after variants have been called.
See the Configuration page for a full list of all available parameters.