VCF File Format

VCF (Variant Call Format) stores genomic variants and metadata.

TL;DR

VCF is a tab-delimited text format for sequence variants.
Core columns are fixed (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO), with optional sample genotype columns.
Coordinates are 1-based, and complex alleles are anchored with a left reference base.
Most production workflows use bgzipped VCF (.vcf.gz) with a tabix index (.tbi/.csi).
bcftools is the standard toolkit for filtering, querying, normalization, and stats.

Structure

A VCF file includes:

metadata lines beginning with ##
one header line beginning with #CHROM
one row per variant record

Minimal example:

##fileformat=VCFv4.3
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1
chr1	10177	rs367896724	A	AC	100	PASS	DP=14	GT	0/1

Fixed columns

CHROM: chromosome/contig
POS: 1-based position
ID: variant identifier (or .)
REF: reference allele
ALT: alternate allele(s), comma-separated if multiallelic
QUAL: variant quality (or .)
FILTER: PASS or filter labels
INFO: semicolon-delimited key-value annotations

Optional sample columns begin with FORMAT, then one column per sample.

Genotype fields

FORMAT defines per-sample subfields (for example GT:DP:AD:GQ).

Example:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	s1	s2
chr2	200123	.	G	A	60	PASS	DP=52	GT:DP:AD:GQ	0/1:25:12,13:99	0/0:27:27,0:99

GT: genotype (0/0, 0/1, 1/1, ./.)
DP: sample read depth
AD: allele depths (REF,ALT1,ALT2,...)
GQ: genotype quality

Coordinate and allele conventions

SNP example: REF=A, ALT=G at POS.
Insertion/deletion alleles are left-anchored by convention.
For indels and multiallelic sites, normalization (left-align + split) is often required before comparison/merge.

Practical Conventions

Compress and index VCFs for fast random access: .vcf.gz + .tbi/.csi.
Keep contig names consistent across reference, BAM/CRAM, and VCF files.
Use normalized, decomposed records for robust comparisons.
Preserve full metadata (##INFO, ##FORMAT, ##FILTER) when transforming files.
Validate sample order and identity before cohort merges.

Header sanity checks

VCF headers are a quick preflight check before filtering, merging, or annotation.

##fileformat confirms parser compatibility expectations.
##contig should match your reference naming/length conventions.
##INFO, ##FORMAT, and ##FILTER definitions should exist for used fields.
#CHROM ... sample columns should match expected sample IDs/order.

# inspect key header metadata
bcftools view -h input.vcf.gz | rg '^##(fileformat|contig|INFO|FORMAT|FILTER)'

# list sample names in header order
bcftools query -l input.vcf.gz

Common Pitfalls

Mixing chr and non-chr contig naming across inputs.
Comparing unnormalized indels and getting false discordance.
Dropping header metadata during manual edits/reformatting.
Misinterpreting missing values (.) vs true zero values.
Treating QUAL as equivalent to genotype quality (GQ).

Common Uses

SNP/indel call representation
Variant filtering and annotation
Cohort and population analyses

Useful VCF Tools

Header provenance (run logs)

From the 2 tools listed below (bcftools and tabix), 1 tool commonly writes command provenance into VCF headers when producing VCF/BCF output:

bcftools (for example view, norm)

They typically add lines such as ##bcftoolsVersion and ##bcftoolsCommand. Within bcftools, submodules such as query and stats usually produce tabular/text outputs (not VCF headers), and tabix creates indexes without editing header metadata.

# inspect provenance lines in a VCF header
bcftools view -h input.vcf.gz | rg '^##(bcftools|source)'

bcftools (`view`, `query`, `norm`, `stats`)

bcftools is one tool with multiple subcommands for filtering, querying, normalization, and QC.

bcftools view

Filter and subset variant records.

# keep PASS variants only
bcftools view -f PASS input.vcf.gz -Oz -o pass.vcf.gz

# keep only biallelic SNPs
bcftools view -m2 -M2 -v snps input.vcf.gz -Oz -o snps.biallelic.vcf.gz

bcftools query

Extract tabular fields for reporting.

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' input.vcf.gz > variants.tsv

bcftools norm

Normalize and split multiallelic records.

bcftools norm \
  -f reference.fasta \
  -m -any input.vcf.gz -Oz -o input.norm.vcf.gz

bcftools stats

Generate summary statistics for QC.

bcftools stats input.vcf.gz > input.vcf.stats.txt

tabix

Index bgzipped VCF for region queries.

tabix -p vcf input.vcf.gz