BAM File Format
BAM is the compressed binary form of SAM, used for aligned sequencing reads.
TL;DR
- BAM is binary/compressed SAM, so it is smaller and faster to process at scale.
- Coordinate-sorted BAM plus index (
.baior.csi) enables fast random-access region queries. - Core SAM fields (especially
CIGAR) describe how each read aligns to the reference. - Optional tags carry rich metadata, including alignment metrics (
NM,MD,AS) and modified bases (MM,ML). RGtags on reads map to header read groups, whereSMdefines the sample name used by many downstream tools.samtoolsis the standard toolkit for sorting, indexing, filtering, and summary stats.
Structure
A BAM file contains:
- Header (SAM-style metadata, binary encoded)
- Alignment records (one per read/alignment)
Conceptually, BAM represents the same fields as SAM (QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, etc.), but in compressed binary form.
Minimal SAM-style example:
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
read0001 99 chr1 10001 60 10M1I15M2D24M = 10120 180 ACGTT... IIII... NM:i:3
@HDand@SQare header lines (format version, sort order, reference dictionary).- The alignment line is tab-delimited and includes core fields plus optional tags (for example
NM:i:3).
CIGAR quick explanation
The CIGAR string encodes how the read aligns to the reference.
M: alignment match/mismatch blockI: insertion in read relative to referenceD: deletion from read relative to referenceS: soft clipping (bases present in read, not aligned)H: hard clipping (bases removed from stored sequence)N: skipped region on reference (common in RNA-seq introns)
For 10M1I15M2D24M:
- 10 aligned bases, then 1 inserted base, then 15 aligned bases, then a 2-base deletion on the reference, then 24 aligned bases.
Modified bases in SAM/BAM tags
Modified bases are typically stored in optional tags, most commonly:
MM: modification type + positions (delta encoded)ML: per-site modification probabilities (byte scale)
Example (schematic):
read0002 0 chr1 20501 50 30M * 0 0 ACGTC... IIII... MM:Z:C+m,5,12; ML:B:C,220,180
C+mmeans 5mC calls on cytosines.- Position list (
5,12) indicates candidate modified-base offsets along the read for that base type. MLvalues (here220,180) are confidence/probability values corresponding to theMMcalls.
Exact interpretation can vary by basecaller/tool version, so always confirm against your caller’s spec when parsing modified-base tags.
RG and SM tags (sample metadata)
RG:Z:<id>appears on alignment records and points to a read group ID.SM:<sample_name>appears in header@RGlines and defines the biological sample for that read group.
Minimal header + record example:
@RG ID:rg001 SM:NA12878 PL:ILLUMINA
read0003 99 chr1 30001 60 50M = 30100 149 ACGT... IIII... RG:Z:rg001
In practice, many downstream tools use SM for per-sample aggregation and RG for lane/library/platform-aware processing (for example duplicate marking and BQSR).
Practical Conventions
- BAM files are usually coordinate-sorted for downstream analysis workflows.
- Indexed BAM (
samtools index) is expected by genome browsers and region-based tools. - Headers often include
@SQ(reference names/lengths),@RG(read groups), and@PG(pipeline provenance). - Large or highly fragmented references may use
.csiindexes instead of.bai.
Header sanity checks
Before downstream analysis, header inspection is a fast sanity check for common integration issues.
@HDconfirms sort order (SO:coordinateexpected in many workflows).@SQverifies contig names and lengths against your reference FASTA.@RGconfirms read-group/sample metadata (for exampleSM) is present.@PGprovides pipeline provenance (what tools/versions touched the BAM).
# inspect BAM header sections
samtools view -H sample.bam | rg '^@HD|^@SQ|^@RG|^@PG'
# quick check for coordinate-sorted header
samtools view -H sample.bam | rg '^@HD'
Why indexed BAM is useful
- Fast random access to specific loci/regions without scanning the entire file.
- Lower I/O and compute for region-based QC, visualization, and variant workflows.
- Enables interactive browsing in tools like IGV/JBrowse with near-instant jumps.
- Essential for cloud/remote workflows where minimizing transferred bytes matters.
Common Pitfalls
- Using an unsorted BAM where a sorted BAM is required (causes tool errors or wrong assumptions).
- Missing or stale index after replacing/updating a BAM file.
- Header/reference mismatch between BAM and reference FASTA used downstream.
- Unexpected duplicate/secondary/supplementary alignments when counting reads naively.
- Ignoring mapping quality (
MAPQ) or duplicate flags in variant/QC analyses.
Common Uses
- Read alignment storage
- Variant calling workflows
- Coverage and mapping quality analysis
Useful BAM Tools
samtools view
Inspect, filter, and convert between BAM/SAM/CRAM.
# quick header + first alignments (SAM text view)
samtools view -h sample.bam | head
# keep mapped primary alignments with MAPQ >= 20
samtools view -b -F 0x904 -q 20 sample.bam > sample.mapq20.primary.bam
samtools sort
Sort BAM records (typically by coordinate).
samtools sort -o sample.sorted.bam sample.bam
samtools index
Build BAM index for random access.
samtools index sample.sorted.bam
samtools flagstat
Generate quick mapping/alignment summary statistics.
samtools flagstat sample.sorted.bam
cramino
Fast long-read alignment QC summaries (commonly used with ONT BAM/CRAM files).
# generate a basic cramino report from a BAM
cramino sample.sorted.bam > sample.cramino.tsv
nanocov
Per-base coverage statistics and plots from BAM files.
# basic run
nanocov \
--input sample.sorted.bam \
--output-dir nanocov_out
# coverage on target regions from BED
nanocov \
--input sample.sorted.bam \
--bed targets.bed \
--output-dir nanocov_targets \
--prefix sample1
bam samtools bioinformatics file formats
816 Words
2026-03-06 19:00 (Last updated: 2026-03-11 02:45)