VCF (Variant Call Format) stores genomic variants and metadata.

TL;DR

  • VCF is a tab-delimited text format for sequence variants.
  • Core columns are fixed (CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO), with optional sample genotype columns.
  • Coordinates are 1-based, and complex alleles are anchored with a left reference base.
  • Most production workflows use bgzipped VCF (.vcf.gz) with a tabix index (.tbi/.csi).
  • bcftools is the standard toolkit for filtering, querying, normalization, and stats.

Structure

A VCF file includes:

  • metadata lines beginning with ##
  • one header line beginning with #CHROM
  • one row per variant record

Minimal example:

##fileformat=VCFv4.3
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1
chr1	10177	rs367896724	A	AC	100	PASS	DP=14	GT	0/1

Fixed columns

  1. CHROM: chromosome/contig
  2. POS: 1-based position
  3. ID: variant identifier (or .)
  4. REF: reference allele
  5. ALT: alternate allele(s), comma-separated if multiallelic
  6. QUAL: variant quality (or .)
  7. FILTER: PASS or filter labels
  8. INFO: semicolon-delimited key-value annotations

Optional sample columns begin with FORMAT, then one column per sample.

Genotype fields

FORMAT defines per-sample subfields (for example GT:DP:AD:GQ).

Example:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	s1	s2
chr2	200123	.	G	A	60	PASS	DP=52	GT:DP:AD:GQ	0/1:25:12,13:99	0/0:27:27,0:99
  • GT: genotype (0/0, 0/1, 1/1, ./.)
  • DP: sample read depth
  • AD: allele depths (REF,ALT1,ALT2,...)
  • GQ: genotype quality

Coordinate and allele conventions

  • SNP example: REF=A, ALT=G at POS.
  • Insertion/deletion alleles are left-anchored by convention.
  • For indels and multiallelic sites, normalization (left-align + split) is often required before comparison/merge.

Practical Conventions

  • Compress and index VCFs for fast random access: .vcf.gz + .tbi/.csi.
  • Keep contig names consistent across reference, BAM/CRAM, and VCF files.
  • Use normalized, decomposed records for robust comparisons.
  • Preserve full metadata (##INFO, ##FORMAT, ##FILTER) when transforming files.
  • Validate sample order and identity before cohort merges.

Header sanity checks

VCF headers are a quick preflight check before filtering, merging, or annotation.

  • ##fileformat confirms parser compatibility expectations.
  • ##contig should match your reference naming/length conventions.
  • ##INFO, ##FORMAT, and ##FILTER definitions should exist for used fields.
  • #CHROM ... sample columns should match expected sample IDs/order.
# inspect key header metadata
bcftools view -h input.vcf.gz | rg '^##(fileformat|contig|INFO|FORMAT|FILTER)'

# list sample names in header order
bcftools query -l input.vcf.gz

Common Pitfalls

  • Mixing chr and non-chr contig naming across inputs.
  • Comparing unnormalized indels and getting false discordance.
  • Dropping header metadata during manual edits/reformatting.
  • Misinterpreting missing values (.) vs true zero values.
  • Treating QUAL as equivalent to genotype quality (GQ).

Common Uses

  • SNP/indel call representation
  • Variant filtering and annotation
  • Cohort and population analyses

Useful VCF Tools

Header provenance (run logs)

From the 2 tools listed below (bcftools and tabix), 1 tool commonly writes command provenance into VCF headers when producing VCF/BCF output:

  • bcftools (for example view, norm)

They typically add lines such as ##bcftoolsVersion and ##bcftoolsCommand. Within bcftools, submodules such as query and stats usually produce tabular/text outputs (not VCF headers), and tabix creates indexes without editing header metadata.

# inspect provenance lines in a VCF header
bcftools view -h input.vcf.gz | rg '^##(bcftools|source)'

bcftools (view, query, norm, stats)

bcftools is one tool with multiple subcommands for filtering, querying, normalization, and QC.

bcftools view

Filter and subset variant records.

# keep PASS variants only
bcftools view -f PASS input.vcf.gz -Oz -o pass.vcf.gz

# keep only biallelic SNPs
bcftools view -m2 -M2 -v snps input.vcf.gz -Oz -o snps.biallelic.vcf.gz

bcftools query

Extract tabular fields for reporting.

bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' input.vcf.gz > variants.tsv

bcftools norm

Normalize and split multiallelic records.

bcftools norm \
  -f reference.fasta \
  -m -any input.vcf.gz -Oz -o input.norm.vcf.gz

bcftools stats

Generate summary statistics for QC.

bcftools stats input.vcf.gz > input.vcf.stats.txt

tabix

Index bgzipped VCF for region queries.

tabix -p vcf input.vcf.gz