HPC usage & utilities¶
Notes on HPC execution¶
Storage requirements¶
The pipeline aggressively deletes intermediate files after each step to minimise disk usage (cleanup = true in nextflow.config). Despite this, large runs can temporarily require many TB of scratch space. Always point -work-dir to a fast, high-capacity scratch filesystem:
Warning
Because intermediate files are removed on task completion, -resume may not skip many steps once variant calling is underway. It is most useful for resuming after the reference indexing stage.
Memory for the Nextflow process itself¶
For large runs, the Java VM managing Nextflow can require significant memory. Set the heap size before running:
Keep the Nextflow process alive in a tmux or screen session — even when using --profile slurm, the Nextflow process must remain running until all jobs complete.
SLURM time limits¶
All SLURM tasks request a 7-day time limit by default. If your cluster enforces shorter queue limits, edit the time directive in the relevant module files (e.g. modules/bwa_mapping.nf).
Concurrency and rate limits¶
The following maxForks settings in nextflow.config prevent overloading storage and compute:
| Process | Default maxForks |
Notes |
|---|---|---|
| SRA download | 10 | NCBI will throttle or block connections with too many parallel downloads (PE and SE separately) |
| fastp trimming | 20 | Very I/O-intensive; reduce if storage I/O is a bottleneck (PE and SE separately) |
| GATK HaplotypeCaller | 150 | High parallelism; reduce if the scheduler struggles under load |
The executor is also rate-limited at 300 queued tasks and 120 submissions per minute (executor.queueSize, executor.submitRateLimit).
Utilities¶
Download SRA files manually¶
If sra-tools encounters connection resets or other SRA-side errors, download files manually with fastq-dump and use --reads instead:
# Install sra-tools via micromamba
micromamba install -c bioconda sra-tools
# Create a download script (one accession per fastq-dump call)
cat > SRA_download.sh << 'EOF'
#!/bin/bash
fastq-dump --split-files --gzip SRR24910574
fastq-dump --split-files --gzip SRR24910575
fastq-dump --split-files --gzip SRR25074049
EOF
# Parallelise the download (limit to 10 concurrent)
parallel -j 10 < SRA_download.sh
Tip
Do not exceed ~10 parallel downloads; NCBI will stall the connection.
Then run the pipeline pointing to the downloaded files:
Rename samples in the final VCF¶
Use bcftools reheader with a two-column whitespace-delimited lookup table:
Format of id_lookup.txt (space-delimited, one mapping per line):
Use a BAM entry point for existing data¶
If you have already-processed BAM files (e.g., from a previous pipeline run), you can skip all read-processing steps:
nextflow run main.nf -config nextflow.config -profile slurm \
--reference $REF --ploidy 1 \
--bam_input '/path/to/bams/*_RG_dedup.bam'
Requirements for input BAMs:
- Coordinate-sorted (use
samtools sort -oor equivalent) @RGread group in the header (picard AddOrReplaceReadGroups).baiindex in the same directory (samtools index)