Home

genomepanel_nf is a Nextflow pipeline for highly efficient reference genome mapping, variant (SNP/indel) calling and quality control of large genome panels. It accepts Illumina paired-end reads from local files, NCBI/ENA SRA accessions, or pre-processed BAM files, and produces fully genotyped and filtered VCF files along with tabulated sample statistics and an HTML report.

We tested genomepanel_nf on 100s to 1000s of samples from plant, animal and fungal species with reference genomes in single-digit Gb sizes. Please report any issues or share feature requests on the GitHub Issues page.

The main design goals were to parallelize tasks as much as possible by splitting reference genomes into segments for variant calling and downstream processing, and to minimize SLURM resource requests dynamically depending on the dataset. Even very large datasets typically peak at single-digit TBs of temporary storage needs.

genomepanel_nf emits by default only filtered and unfiltered VCF files to save disk space. Intermediate files (BAMs, GVCFs) can be optionally retained though. The pipeline can also be switched to emit confidence scores at invariant sites.

The pipeline is designed to run both on HPC clusters via SLURM or on single machines, and uses Singularity containers so all software dependencies are reproducible and portable.

Get started View on GitHub

Pipeline Summary¶

%%{init: {"theme": "base", "themeVariables": {"fontSize": "16px", "primaryColor": "#f0f7f5", "primaryBorderColor": "#1a7f6e", "primaryTextColor": "#1a1a1a", "lineColor": "#555"}}}%%
flowchart TD
    classDef io fill:#1a7f6e,color:#fff,stroke:#0d5c50
    classDef step fill:#f0f7f5,color:#1a1a1a,stroke:#1a7f6e,stroke-width:1.5px

    A1([Local FASTQ\n--reads]):::io
    A2([SRA / ENA\n--SRA_index]):::io
    A3([BAM files\n--bam_input]):::io
    REF([Reference genome\n--reference]):::io

    T[fastp\ntrimming & QC]:::step
    M[bwa-mem2\nread mapping]:::step
    D[Picard\nread groups & deduplication]:::step
    HC[GATK HaplotypeCaller\nper-sample GVCF]:::step
    JG[GATK joint genotyping\nCombineGVCFs + GenotypeGVCFs]:::step
    VF[VariantFiltration + bcftools\nflagging & filtering]:::step
    PG[vcftools\npop-gen VCF]:::step
    QC[R\npipeline_report.html]:::step

    O1([final_variants.vcf.gz]):::io
    O2([final_variants.clean.vcf.gz]):::io
    O3([pop-gen VCF]):::io
    O4([pipeline_report.html]):::io

    A1 --> T
    A2 --> T
    T --> M
    REF --> M
    REF --> HC
    M --> D
    D --> HC
    A3 --> HC
    HC --> JG
    JG --> VF
    VF --> O1
    VF --> O2
    O1 --> PG
    PG --> O3
    O2 --> QC
    QC --> O4

The pipeline accepts three input modes that converge at the variant calling step:

Local FASTQ files (--reads): raw paired-end Illumina reads are trimmed with fastp, mapped with bwa-mem2, sorted with samtools, assigned read groups and deduplicated with Picard.
SRA/ENA accessions (--SRA_index): accessions are resolved to run IDs and download URLs; paired-end and single-end runs are handled automatically before joining the same read-processing path.
Pre-processed BAM files (--bam_input): coordinate-sorted, read-group annotated BAMs skip directly to variant calling.

After read processing the pipeline performs joint genotyping across all samples using GATK HaplotypeCaller (GVCF mode), CombineGVCFs and GenotypeGVCFs. Variant filtration follows GATK best practices. A population-genetics VCF (thinned, MAF-filtered) is generated with vcftools. All QC metrics are collected into a single pipeline_report.html.

Key features¶

Flexible input

Local FASTQ, SRA/ENA accessions, or pre-processed BAM files — all in the same run.
HPC-ready

SLURM profile with optimised resource requests. Parallel variant calling across genome segments.
Fully containerised

All tools run inside Singularity images pulled from the Galaxy Project depot.
Built-in QC report

A single pipeline_report.html with fastp, BWA-mem2 and VCF quality plots.

Software used by genomepanel_nf¶

Please cite the underlying tools if you use them through this pipeline.

Tool	Version	Role	Reference
fastp	1.3.1	Adapter trimming, read QC	Chen et al. 2018, Bioinformatics doi:10.1093/bioinformatics/bty560
bwa-mem2	2.3	Read mapping	Vasimuddin et al. 2019, IPDPS doi:10.1109/IPDPS.2019.00041
SAMtools	1.23.1	BAM manipulation & sorting	Danecek et al. 2021, GigaScience doi:10.1093/gigascience/giab008
Picard	3.4.0	Read group assignment, duplicate removal	Broad Institute 2019
GATK	4.6.2.0	Haplotype calling, genotyping, variant filtration	Van der Auwera & O'Connor 2020
BCFtools	1.23.1	VCF filtering and manipulation	Danecek et al. 2021, GigaScience doi:10.1093/gigascience/giab008
vcftools	0.1.17	Population-genetics VCF processing	Danecek et al. 2011, Bioinformatics doi:10.1093/bioinformatics/btr330
sra-tools	3.2.1	NCBI SRA data download	NCBI
entrez-direct	24.0	NCBI SRA accession resolution	NCBI
R / tidyverse	1.2.1	QC visualisation	R Core Team; Wickham et al. 2019, JOSS doi:10.21105/joss.01686

Release notes¶

v1.0.11 (July 2026)¶

Stopped Samtools and Picard from filtering out low-quality or duplicate reads; all reads are now retained in BAM files.
New --use_duplicate_reads option to use duplicate reads in GATK HaplotypeCaller (e.g. for ddRAD or RAD-seq data).

v1.0.10 (June 2026)¶

Fixed VCF sample names with --SRR_sample_map: GenomicsDBImport was reassigning names from gVCF filenames, discarding biological names set by Picard.
Fixed filterReference rewritten as a streaming parser to avoid loading the entire genome into RAM.
Fixed QC report crash when sample names share a common prefix.
Memory scaling broadened; GenomicsDBImport and GenotypeGVCFs start at 16 GB.
genomicsdb_batch_size default raised to 200; pipeline exits immediately on unrecognised --params.
Added MergeGVCFs module for --keep_gvcf runs with sub-chromosomal segmentation.

v1.0.9 (May 2026)¶

Added required --slurm_queue parameter when using -profile slurm. The pipeline now validates this at startup and exits with a clear error if it is missing. The specified partition should allow a walltime of at least 7 days.

v1.0.8 (May 2026)¶

BAM files are now deleted mid-run once all GATKHC tasks have finished, freeing disk space without risking premature removal.
Fixed zero-length vector crash in r_process_summary_fastp.R for single-end samples (duplication$rate / insert_size$peak null guard).

v1.0.7 (May 2026)¶

Replaced CombineGVCFs with GenomicsDBImport for GVCF consolidation. Eliminates in-memory GC-thrashing at large sample counts; batch size tunable via --genomicsdb_batch_size.
Nextflow ≥ 26.x compatibility: moved log.info inside the workflow block, replaced C-style loops with Groovy functional expressions, and updated nextflow.config env-var syntax.
Added JVM flags -XX:-UsePerfData --enable-native-access=ALL-UNNAMED to all GATK processes to suppress Java 17+ warnings.
Fixed gatkIndex memory to prevent a -Xmx0g crash on first attempt.
Fixed single-end sample crash in r_process_summary_fastp.R (insert_size$peak is absent for SE data).

v1.0.6 (April 2026)¶

Switch to automatic Singularity image pulling. No manual action required.
Included fully functional example dataset.

v1.0.5 (April 2026)¶

Fixed HTML QC report generation
Improved stability of SRA downloader

v1.0.4b (April 2026)¶

Fixed invalid retryStrategy process directive.

v1.0.4 (April 2026)¶

Added GitHub Pages documentation site at crolllab.github.io/genomepanel_nf.

v1.0.3 (April 2026)¶

Updated all Singularity container images to the latest Galaxy depot releases.
Redesigned QC report (pipeline_report.html): unified fastp, BWA and variant quality sections into a single HTML report with inline PDF plots (font-independent, works in all containers and browsers).
Optimised SLURM resource requests across all pipeline modules.

Questions and feature requests¶

Questions, bug reports and feature requests are welcome on the GitHub Issues page.

Questions? Found a bug? Feature requests?

Go to the GitHub Issues page

How to cite¶

If you use genomepanel_nf in your research, please cite:

Croll, D. (2026). genomepanel_nf - a highly efficient Nextflow pipeline for reference genome variant calling of large genome panels (v1.0.5). Zenodo. https://doi.org/10.5281/zenodo.19392838

Please also cite the underlying tools listed in the Software table above.