Skip to content

HPC usage & utilities

Notes on HPC execution

Storage requirements

The pipeline aggressively deletes intermediate files after each step to minimise disk usage (cleanup = true in nextflow.config). Despite this, large runs can temporarily require many TB of scratch space. Always point -work-dir to a fast, high-capacity scratch filesystem:

nextflow run main.nf ... -work-dir '/path/to/scratch/genomepanel_work'

Warning

Because intermediate files are removed on task completion, -resume may not skip many steps once variant calling is underway. It is most useful for resuming after the reference indexing stage.

Keep the Nextflow process alive in a tmux or screen session — even when using --profile slurm, the Nextflow process must remain running until all jobs complete.

SLURM time limits

All SLURM tasks request a 7-day time limit by default. If your cluster enforces shorter queue limits, edit the time directive in the relevant module files (e.g. modules/bwa_mapping.nf).

Singularity / Apptainer

Image cache — by default, images are pulled to $HOME/.singularity/cache. Worker nodes must have read access to this path. If /home is not shared across nodes, set cacheDir in the singularity {} block of nextflow.config to a path on a shared filesystem:

singularity {
    cacheDir = "/scratch/$USER/.singularity/cache"
}

Bind mounts — on clusters running Apptainer (the successor to Singularity), automatic bind-mounting can be disabled by the sysadmin via /etc/apptainer/apptainer.conf (mount hostfs = no). If tasks fail with "file not found" errors inside the container, add explicit bind paths in nextflow.config:

singularity.runOptions = "--bind /scratch,/data"

Concurrency and rate limits

The following maxForks settings in nextflow.config prevent overloading external connections and storage capabilities:

Process Default maxForks Notes
SRA download 10 NCBI will throttle or block connections with too many parallel downloads (PE and SE separately)
fastp trimming 20 Very I/O-intensive; reduce if storage I/O is a bottleneck (PE and SE separately)
GATK HaplotypeCaller 150 High parallelism; reduce if the scheduler struggles under load

The executor is also rate-limited at 300 queued tasks and 240 submissions per minute (executor.queueSize, executor.submitRateLimit).


Utilities

Download SRA files manually

If sra-tools encounters connection resets or other SRA-side errors, download files manually with fastq-dump and use --reads to pass the downloaded files instead:

# Install sra-tools via micromamba
micromamba install -c bioconda sra-tools

# Create a download script (one accession per fastq-dump call)
cat > SRA_download.sh << 'EOF'
#!/bin/bash
fastq-dump --split-files --gzip SRR24910574
fastq-dump --split-files --gzip SRR24910575
fastq-dump --split-files --gzip SRR25074049
EOF

# Parallelise the download (limit to 10 concurrent)
parallel -j 10 < SRA_download.sh

Tip

Do not exceed ~10 parallel downloads; NCBI will stall the connection.

Then run the pipeline pointing to the downloaded files:

nextflow run main.nf ... --reads '/path/to/downloads/*{1,2}.fastq.gz'

Rename samples in the final VCF

Note that you can define custom names among the configuration options.

If you need to change sample names in the final VCF, use bcftools reheader with a two-column whitespace-delimited lookup table:

bcftools reheader --samples id_lookup.txt input.vcf.gz -Oz > output_reheadered.vcf.gz

Format of id_lookup.txt (space-delimited, one mapping per line):

oldname1 newname1
oldname2 newname2
oldname3 newname3

Use a BAM entry point for existing data

If you have already-processed BAM files (e.g., from a previous pipeline run), you can skip all read-processing steps:

nextflow run main.nf -config nextflow.config \
  --reference $REF --ploidy 1 \
  --bam_input '/path/to/bams/*_RG_dedup.bam'

Requirements for input BAMs:

  • Coordinate-sorted (use samtools sort -o or equivalent)
  • @RG read group in the header (picard AddOrReplaceReadGroups)
  • .bai index in the same directory (samtools index)