HPC usage & utilities¶

Notes on HPC execution¶

Storage requirements¶

The pipeline aggressively deletes intermediate files after each step to minimise disk usage (cleanup = true in nextflow.config). Despite this, large runs can temporarily require many TB of scratch space. Always point -work-dir to a fast, high-capacity scratch filesystem:

nextflow run main.nf ... -work-dir '/path/to/scratch/genomepanel_work'

Warning

Because intermediate files are removed on task completion, -resume may not skip many steps once variant calling is underway. It is most useful for resuming after the reference indexing stage.

Memory for the Nextflow process itself¶

For large runs, the Java VM managing Nextflow can require significant memory. Set the heap size before running:

export NXF_OPTS='-Xms8g -Xmx64g'

Keep the Nextflow process alive in a tmux or screen session — even when using --profile slurm, the Nextflow process must remain running until all jobs complete.

SLURM time limits¶

All SLURM tasks request a 7-day time limit by default. If your cluster enforces shorter queue limits, edit the time directive in the relevant module files (e.g. modules/bwa_mapping.nf).

Concurrency and rate limits¶

The following maxForks settings in nextflow.config prevent overloading storage and compute:

Process	Default `maxForks`	Notes
SRA download	10	NCBI will throttle or block connections with too many parallel downloads (PE and SE separately)
fastp trimming	20	Very I/O-intensive; reduce if storage I/O is a bottleneck (PE and SE separately)
GATK HaplotypeCaller	150	High parallelism; reduce if the scheduler struggles under load

The executor is also rate-limited at 300 queued tasks and 120 submissions per minute (executor.queueSize, executor.submitRateLimit).

Utilities¶

Download SRA files manually¶

If sra-tools encounters connection resets or other SRA-side errors, download files manually with fastq-dump and use --reads instead:

# Install sra-tools via micromamba
micromamba install -c bioconda sra-tools

# Create a download script (one accession per fastq-dump call)
cat > SRA_download.sh << 'EOF'
#!/bin/bash
fastq-dump --split-files --gzip SRR24910574
fastq-dump --split-files --gzip SRR24910575
fastq-dump --split-files --gzip SRR25074049
EOF

# Parallelise the download (limit to 10 concurrent)
parallel -j 10 < SRA_download.sh

Tip

Do not exceed ~10 parallel downloads; NCBI will stall the connection.

Then run the pipeline pointing to the downloaded files:

nextflow run main.nf ... --reads '/path/to/downloads/*{1,2}.fastq.gz'

Rename samples in the final VCF¶

Use bcftools reheader with a two-column whitespace-delimited lookup table:

bcftools reheader --samples id_lookup.txt input.vcf.gz -Oz > output_reheadered.vcf.gz

Format of id_lookup.txt (space-delimited, one mapping per line):

oldname1 newname1
oldname2 newname2
oldname3 newname3

Use a BAM entry point for existing data¶

If you have already-processed BAM files (e.g., from a previous pipeline run), you can skip all read-processing steps:

nextflow run main.nf -config nextflow.config -profile slurm \
  --reference $REF --ploidy 1 \
  --bam_input '/path/to/bams/*_RG_dedup.bam'

Requirements for input BAMs:

Coordinate-sorted (use samtools sort -o or equivalent)
@RG read group in the header (picard AddOrReplaceReadGroups)
.bai index in the same directory (samtools index)