BAM File Format

BAM is the compressed binary form of SAM, used for aligned sequencing reads.

TL;DR

BAM is binary/compressed SAM, so it is smaller and faster to process at scale.
Coordinate-sorted BAM plus index (.bai or .csi) enables fast random-access region queries.
Core SAM fields (especially CIGAR) describe how each read aligns to the reference.
Optional tags carry rich metadata, including alignment metrics (NM, MD, AS) and modified bases (MM, ML).
RG tags on reads map to header read groups, where SM defines the sample name used by many downstream tools.
samtools is the standard toolkit for sorting, indexing, filtering, and summary stats.

Structure

A BAM file contains:

Header (SAM-style metadata, binary encoded)
Alignment records (one per read/alignment)

Conceptually, BAM represents the same fields as SAM (QNAME, FLAG, RNAME, POS, MAPQ, CIGAR, etc.), but in compressed binary form.

Minimal SAM-style example:

@HD	VN:1.6	SO:coordinate
@SQ	SN:chr1	LN:248956422
read0001	99	chr1	10001	60	10M1I15M2D24M	=	10120	180	ACGTT...	IIII...	NM:i:3

@HD and @SQ are header lines (format version, sort order, reference dictionary).
The alignment line is tab-delimited and includes core fields plus optional tags (for example NM:i:3).

CIGAR quick explanation

The CIGAR string encodes how the read aligns to the reference.

M: alignment match/mismatch block
I: insertion in read relative to reference
D: deletion from read relative to reference
S: soft clipping (bases present in read, not aligned)
H: hard clipping (bases removed from stored sequence)
N: skipped region on reference (common in RNA-seq introns)

For 10M1I15M2D24M:

10 aligned bases, then 1 inserted base, then 15 aligned bases, then a 2-base deletion on the reference, then 24 aligned bases.

Modified bases in SAM/BAM tags

Modified bases are typically stored in optional tags, most commonly:

MM: modification type + positions (delta encoded)
ML: per-site modification probabilities (byte scale)

Example (schematic):

read0002	0	chr1	20501	50	30M	*	0	0	ACGTC...	IIII...	MM:Z:C+m,5,12;	ML:B:C,220,180

C+m means 5mC calls on cytosines.
Position list (5,12) indicates candidate modified-base offsets along the read for that base type.
ML values (here 220,180) are confidence/probability values corresponding to the MM calls.

Exact interpretation can vary by basecaller/tool version, so always confirm against your caller’s spec when parsing modified-base tags.

`RG` and `SM` tags (sample metadata)

RG:Z:<id> appears on alignment records and points to a read group ID.
SM:<sample_name> appears in header @RG lines and defines the biological sample for that read group.

Minimal header + record example:

@RG	ID:rg001	SM:NA12878	PL:ILLUMINA
read0003	99	chr1	30001	60	50M	=	30100	149	ACGT...	IIII...	RG:Z:rg001

In practice, many downstream tools use SM for per-sample aggregation and RG for lane/library/platform-aware processing (for example duplicate marking and BQSR).

Practical Conventions

BAM files are usually coordinate-sorted for downstream analysis workflows.
Indexed BAM (samtools index) is expected by genome browsers and region-based tools.
Headers often include @SQ (reference names/lengths), @RG (read groups), and @PG (pipeline provenance).
Large or highly fragmented references may use .csi indexes instead of .bai.

Header sanity checks

Before downstream analysis, header inspection is a fast sanity check for common integration issues.

@HD confirms sort order (SO:coordinate expected in many workflows).
@SQ verifies contig names and lengths against your reference FASTA.
@RG confirms read-group/sample metadata (for example SM) is present.
@PG provides pipeline provenance (what tools/versions touched the BAM).

# inspect BAM header sections
samtools view -H sample.bam | rg '^@HD|^@SQ|^@RG|^@PG'

# quick check for coordinate-sorted header
samtools view -H sample.bam | rg '^@HD'

Why indexed BAM is useful

Fast random access to specific loci/regions without scanning the entire file.
Lower I/O and compute for region-based QC, visualization, and variant workflows.
Enables interactive browsing in tools like IGV/JBrowse with near-instant jumps.
Essential for cloud/remote workflows where minimizing transferred bytes matters.

Common Pitfalls

Using an unsorted BAM where a sorted BAM is required (causes tool errors or wrong assumptions).
Missing or stale index after replacing/updating a BAM file.
Header/reference mismatch between BAM and reference FASTA used downstream.
Unexpected duplicate/secondary/supplementary alignments when counting reads naively.
Ignoring mapping quality (MAPQ) or duplicate flags in variant/QC analyses.

Common Uses

Read alignment storage
Variant calling workflows
Coverage and mapping quality analysis

Useful BAM Tools

samtools view

Inspect, filter, and convert between BAM/SAM/CRAM.

# quick header + first alignments (SAM text view)
samtools view -h sample.bam | head

# keep mapped primary alignments with MAPQ >= 20
samtools view -b -F 0x904 -q 20 sample.bam > sample.mapq20.primary.bam

samtools sort

Sort BAM records (typically by coordinate).

samtools sort -o sample.sorted.bam sample.bam

samtools index

Build BAM index for random access.

samtools index sample.sorted.bam

samtools flagstat

Generate quick mapping/alignment summary statistics.

samtools flagstat sample.sorted.bam

cramino

Fast long-read alignment QC summaries (commonly used with ONT BAM/CRAM files).

# generate a basic cramino report from a BAM
cramino sample.sorted.bam > sample.cramino.tsv

nanocov

Per-base coverage statistics and plots from BAM files.

# basic run
nanocov \
  --input sample.sorted.bam \
  --output-dir nanocov_out

# coverage on target regions from BED
nanocov \
  --input sample.sorted.bam \
  --bed targets.bed \
  --output-dir nanocov_targets \
  --prefix sample1