BAM is the compressed binary form of SAM, used for aligned sequencing reads.

TL;DR

  • BAM is binary/compressed SAM, so it is smaller and faster to process at scale.
  • Coordinate-sorted BAM plus index (.bai or .csi) enables fast random-access region queries.
  • Core SAM fields (especially CIGAR) describe how each read aligns to the reference.
  • Optional tags carry rich metadata, including alignment metrics (NM, MD, AS) and modified bases (MM, ML).
  • RG tags on reads map to header read groups, where SM defines the sample name used by many downstream tools.
  • samtools is the standard toolkit for sorting, indexing, filtering, and summary stats.

Structure

A BAM file contains:

  1. Header (SAM-style metadata, binary encoded)
  2. Alignment records (one per read/alignment)

Conceptually, BAM represents the same fields as SAM (QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, etc.), but in compressed binary form.

Minimal SAM-style example:

@HD	VN:1.6	SO:coordinate
@SQ	SN:chr1	LN:248956422
read0001	99	chr1	10001	60	10M1I15M2D24M	=	10120	180	ACGTT...	IIII...	NM:i:3
  • @HD and @SQ are header lines (format version, sort order, reference dictionary).
  • The alignment line is tab-delimited and includes core fields plus optional tags (for example NM:i:3).

CIGAR quick explanation

The CIGAR string encodes how the read aligns to the reference.

  • M: alignment match/mismatch block
  • I: insertion in read relative to reference
  • D: deletion from read relative to reference
  • S: soft clipping (bases present in read, not aligned)
  • H: hard clipping (bases removed from stored sequence)
  • N: skipped region on reference (common in RNA-seq introns)

For 10M1I15M2D24M:

  • 10 aligned bases, then 1 inserted base, then 15 aligned bases, then a 2-base deletion on the reference, then 24 aligned bases.

Modified bases in SAM/BAM tags

Modified bases are typically stored in optional tags, most commonly:

  • MM: modification type + positions (delta encoded)
  • ML: per-site modification probabilities (byte scale)

Example (schematic):

read0002	0	chr1	20501	50	30M	*	0	0	ACGTC...	IIII...	MM:Z:C+m,5,12;	ML:B:C,220,180
  • C+m means 5mC calls on cytosines.
  • Position list (5,12) indicates candidate modified-base offsets along the read for that base type.
  • ML values (here 220,180) are confidence/probability values corresponding to the MM calls.

Exact interpretation can vary by basecaller/tool version, so always confirm against your caller’s spec when parsing modified-base tags.

RG and SM tags (sample metadata)

  • RG:Z:<id> appears on alignment records and points to a read group ID.
  • SM:<sample_name> appears in header @RG lines and defines the biological sample for that read group.

Minimal header + record example:

@RG	ID:rg001	SM:NA12878	PL:ILLUMINA
read0003	99	chr1	30001	60	50M	=	30100	149	ACGT...	IIII...	RG:Z:rg001

In practice, many downstream tools use SM for per-sample aggregation and RG for lane/library/platform-aware processing (for example duplicate marking and BQSR).

Practical Conventions

  • BAM files are usually coordinate-sorted for downstream analysis workflows.
  • Indexed BAM (samtools index) is expected by genome browsers and region-based tools.
  • Headers often include @SQ (reference names/lengths), @RG (read groups), and @PG (pipeline provenance).
  • Large or highly fragmented references may use .csi indexes instead of .bai.

Header sanity checks

Before downstream analysis, header inspection is a fast sanity check for common integration issues.

  • @HD confirms sort order (SO:coordinate expected in many workflows).
  • @SQ verifies contig names and lengths against your reference FASTA.
  • @RG confirms read-group/sample metadata (for example SM) is present.
  • @PG provides pipeline provenance (what tools/versions touched the BAM).
# inspect BAM header sections
samtools view -H sample.bam | rg '^@HD|^@SQ|^@RG|^@PG'

# quick check for coordinate-sorted header
samtools view -H sample.bam | rg '^@HD'

Why indexed BAM is useful

  • Fast random access to specific loci/regions without scanning the entire file.
  • Lower I/O and compute for region-based QC, visualization, and variant workflows.
  • Enables interactive browsing in tools like IGV/JBrowse with near-instant jumps.
  • Essential for cloud/remote workflows where minimizing transferred bytes matters.

Common Pitfalls

  • Using an unsorted BAM where a sorted BAM is required (causes tool errors or wrong assumptions).
  • Missing or stale index after replacing/updating a BAM file.
  • Header/reference mismatch between BAM and reference FASTA used downstream.
  • Unexpected duplicate/secondary/supplementary alignments when counting reads naively.
  • Ignoring mapping quality (MAPQ) or duplicate flags in variant/QC analyses.

Common Uses

  • Read alignment storage
  • Variant calling workflows
  • Coverage and mapping quality analysis

Useful BAM Tools

samtools view

Inspect, filter, and convert between BAM/SAM/CRAM.

# quick header + first alignments (SAM text view)
samtools view -h sample.bam | head

# keep mapped primary alignments with MAPQ >= 20
samtools view -b -F 0x904 -q 20 sample.bam > sample.mapq20.primary.bam

samtools sort

Sort BAM records (typically by coordinate).

samtools sort -o sample.sorted.bam sample.bam

samtools index

Build BAM index for random access.

samtools index sample.sorted.bam

samtools flagstat

Generate quick mapping/alignment summary statistics.

samtools flagstat sample.sorted.bam

cramino

Fast long-read alignment QC summaries (commonly used with ONT BAM/CRAM files).

# generate a basic cramino report from a BAM
cramino sample.sorted.bam > sample.cramino.tsv

nanocov

Per-base coverage statistics and plots from BAM files.

# basic run
nanocov \
  --input sample.sorted.bam \
  --output-dir nanocov_out

# coverage on target regions from BED
nanocov \
  --input sample.sorted.bam \
  --bed targets.bed \
  --output-dir nanocov_targets \
  --prefix sample1