Home

genomepanel_nf is a Nextflow pipeline for highly efficient reference genome mapping, variant (SNP/indel) calling and quality control of large genome panels. It accepts Illumina paired-end reads from local files, NCBI/ENA SRA accessions, or pre-processed BAM files, and produces fully genotyped and filtered VCF files along with tabulated sample statistics and an HTML report.

We tested genomepanel_nf on 100s to 1000s of samples from plant, animal and fungal species with reference genomes in single-digit Gb sizes. Please report any issues or share feature requests on the GitHub Issues page.

The main design goals were to parallelize tasks as much as possible by splitting reference genomes into segments for variant calling and downstream processing, and to minimize SLURM resource requests dynamically depending on the dataset. Even very large datasets typically peak at single-digit TBs of temporary storage needs.

genomepanel_nf emits by default only filtered and unfiltered VCF files to save disk space. Intermediate files (BAMs, GVCFs) can be optionally retained though. The pipeline can also be switched to emit confidence scores at invariant sites.

The pipeline is designed to run both on HPC clusters via SLURM or on single machines, and uses Singularity containers so all software dependencies are reproducible and portable.

Get started View on GitHub

Pipeline Summary¶

%%{init: {"theme": "base", "themeVariables": {"fontSize": "16px", "primaryColor": "#f0f7f5", "primaryBorderColor": "#1a7f6e", "primaryTextColor": "#1a1a1a", "lineColor": "#555"}}}%%
flowchart TD
    classDef io fill:#1a7f6e,color:#fff,stroke:#0d5c50
    classDef step fill:#f0f7f5,color:#1a1a1a,stroke:#1a7f6e,stroke-width:1.5px

    A1([Local FASTQ\n--reads]):::io
    A2([SRA / ENA\n--SRA_index]):::io
    A3([BAM files\n--bam_input]):::io
    REF([Reference genome\n--reference]):::io

    T[fastp\ntrimming & QC]:::step
    M[bwa-mem2\nread mapping]:::step
    D[Picard\nread groups & deduplication]:::step
    HC[GATK HaplotypeCaller\nper-sample GVCF]:::step
    JG[GATK joint genotyping\nCombineGVCFs + GenotypeGVCFs]:::step
    VF[VariantFiltration + bcftools\nflagging & filtering]:::step
    PG[vcftools\npop-gen VCF]:::step
    QC[R\npipeline_report.html]:::step

    O1([final_variants.vcf.gz]):::io
    O2([final_variants.clean.vcf.gz]):::io
    O3([pop-gen VCF]):::io
    O4([pipeline_report.html]):::io

    A1 --> T
    A2 --> T
    T --> M
    REF --> M
    REF --> HC
    M --> D
    D --> HC
    A3 --> HC
    HC --> JG
    JG --> VF
    VF --> O1
    VF --> O2
    O1 --> PG
    PG --> O3
    O2 --> QC
    QC --> O4

The pipeline accepts three input modes that converge at the variant calling step:

Local FASTQ files (--reads): raw paired-end Illumina reads are trimmed with fastp, mapped with bwa-mem2, sorted with samtools, assigned read groups and deduplicated with Picard.
SRA/ENA accessions (--SRA_index): accessions are resolved to run IDs and download URLs; paired-end and single-end runs are handled automatically before joining the same read-processing path.
Pre-processed BAM files (--bam_input): coordinate-sorted, read-group annotated BAMs skip directly to variant calling.

After read processing the pipeline performs joint genotyping across all samples using GATK HaplotypeCaller (GVCF mode), CombineGVCFs and GenotypeGVCFs. Variant filtration follows GATK best practices. A population-genetics VCF (thinned, MAF-filtered) is generated with vcftools. All QC metrics are collected into a single pipeline_report.html.

Key features¶

Flexible input

Local FASTQ, SRA/ENA accessions, or pre-processed BAM files — all in the same run.
HPC-ready

SLURM profile with optimised resource requests. Parallel variant calling across genome segments.
Fully containerised

All tools run inside Singularity images pulled from the Galaxy Project depot.
Built-in QC report

A single pipeline_report.html with fastp, BWA-mem2 and VCF quality plots.

Software used by genomepanel_nf¶

Please cite the underlying tools if you use them through this pipeline.

Tool	Version	Role	Reference
fastp	1.3.1	Adapter trimming, read QC	Chen et al. 2018, Bioinformatics doi:10.1093/bioinformatics/bty560
bwa-mem2	2.3	Read mapping	Vasimuddin et al. 2019, IPDPS doi:10.1109/IPDPS.2019.00041
SAMtools	1.23.1	BAM manipulation & sorting	Danecek et al. 2021, GigaScience doi:10.1093/gigascience/giab008
Picard	3.4.0	Read group assignment, duplicate removal	Broad Institute 2019
GATK	4.6.2.0	Haplotype calling, genotyping, variant filtration	Van der Auwera & O'Connor 2020
BCFtools	1.23.1	VCF filtering and manipulation	Danecek et al. 2021, GigaScience doi:10.1093/gigascience/giab008
vcftools	0.1.17	Population-genetics VCF processing	Danecek et al. 2011, Bioinformatics doi:10.1093/bioinformatics/btr330
sra-tools	3.2.1	NCBI SRA data download	NCBI
entrez-direct	24.0	NCBI SRA accession resolution	NCBI
R / tidyverse	1.2.1	QC visualisation	R Core Team; Wickham et al. 2019, JOSS doi:10.21105/joss.01686

Release notes¶

v1.0.5 (April 2026)¶

Improved module resource requests: CleanVCFs base memory raised from 2 GB to 4 GB, preventing GATK JVM heap underrun on large intervals.
Improved process label readability: shortened tag strings across multiple modules for cleaner Nextflow log output.
Fixed HTML QC report generation: R plotting and summary scripts are now supplied as external files rather than here-documents, resolving character-escaping issues that could silently corrupt the report.
SRA downloader: pinned sra-tools to 3.2.1 (3.4.1 has known segfaults); replaced deprecated --output-file flag; added timeouts and exponential backoff retry delays (10 / 30 / 60 min).

v1.0.4b (April 2026)¶

Fixed invalid retryStrategy process directive.

v1.0.4 (April 2026)¶

Added GitHub Pages documentation site at crolllab.github.io/genomepanel_nf.

v1.0.3 (April 2026)¶

Updated all Singularity container images to the latest Galaxy depot releases: sra-tools 3.4.1, fastp 1.3.1, bwa-mem2 2.3, samtools 1.23.1, bcftools 1.23.1, gatk4-spark 4.6.2.0 build 1.
Redesigned QC report (pipeline_report.html): unified fastp, BWA and variant quality sections into a single HTML report with inline PDF plots (font-independent, works in all containers and browsers).
Optimised SLURM resource requests across all pipeline modules.
Added --bam_input alternative entry point for pre-processed BAM files.
Added --min_contig_length option to filter short reference contigs.
Added --bwa_index option to supply pre-built BWA-mem2 index files.

Questions and feature requests¶

Questions, bug reports and feature requests are welcome on the GitHub Issues page.

Questions? Found a bug? Feature requests?

Go to the GitHub Issues page

How to cite¶

If you use genomepanel_nf in your research, please cite:

Croll, D. (2026). genomepanel_nf - a highly efficient Nextflow pipeline for reference genome variant calling of large genome panels (v1.0.5). Zenodo. https://doi.org/10.5281/zenodo.19392838

Please also cite the underlying tools listed in the Software table above.