VCF File Format
VCF (Variant Call Format) stores genomic variants and metadata.
TL;DR
- VCF is a tab-delimited text format for sequence variants.
- Core columns are fixed (
CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO), with optional sample genotype columns. - Coordinates are 1-based, and complex alleles are anchored with a left reference base.
- Most production workflows use bgzipped VCF (
.vcf.gz) with a tabix index (.tbi/.csi). bcftoolsis the standard toolkit for filtering, querying, normalization, and stats.
Structure
A VCF file includes:
- metadata lines beginning with
## - one header line beginning with
#CHROM - one row per variant record
Minimal example:
##fileformat=VCFv4.3
##contig=<ID=chr1,length=248956422>
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sample1
chr1 10177 rs367896724 A AC 100 PASS DP=14 GT 0/1
Fixed columns
CHROM: chromosome/contigPOS: 1-based positionID: variant identifier (or.)REF: reference alleleALT: alternate allele(s), comma-separated if multiallelicQUAL: variant quality (or.)FILTER:PASSor filter labelsINFO: semicolon-delimited key-value annotations
Optional sample columns begin with FORMAT, then one column per sample.
Genotype fields
FORMAT defines per-sample subfields (for example GT:DP:AD:GQ).
Example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT s1 s2
chr2 200123 . G A 60 PASS DP=52 GT:DP:AD:GQ 0/1:25:12,13:99 0/0:27:27,0:99
GT: genotype (0/0,0/1,1/1,./.)DP: sample read depthAD: allele depths (REF,ALT1,ALT2,...)GQ: genotype quality
Coordinate and allele conventions
- SNP example:
REF=A,ALT=GatPOS. - Insertion/deletion alleles are left-anchored by convention.
- For indels and multiallelic sites, normalization (left-align + split) is often required before comparison/merge.
Practical Conventions
- Compress and index VCFs for fast random access:
.vcf.gz+.tbi/.csi. - Keep contig names consistent across reference, BAM/CRAM, and VCF files.
- Use normalized, decomposed records for robust comparisons.
- Preserve full metadata (
##INFO,##FORMAT,##FILTER) when transforming files. - Validate sample order and identity before cohort merges.
Header sanity checks
VCF headers are a quick preflight check before filtering, merging, or annotation.
##fileformatconfirms parser compatibility expectations.##contigshould match your reference naming/length conventions.##INFO,##FORMAT, and##FILTERdefinitions should exist for used fields.#CHROM ...sample columns should match expected sample IDs/order.
# inspect key header metadata
bcftools view -h input.vcf.gz | rg '^##(fileformat|contig|INFO|FORMAT|FILTER)'
# list sample names in header order
bcftools query -l input.vcf.gz
Common Pitfalls
- Mixing
chrand non-chrcontig naming across inputs. - Comparing unnormalized indels and getting false discordance.
- Dropping header metadata during manual edits/reformatting.
- Misinterpreting missing values (
.) vs true zero values. - Treating
QUALas equivalent to genotype quality (GQ).
Common Uses
- SNP/indel call representation
- Variant filtering and annotation
- Cohort and population analyses
Useful VCF Tools
Header provenance (run logs)
From the 2 tools listed below (bcftools and tabix), 1 tool commonly writes command provenance into VCF headers when producing VCF/BCF output:
bcftools(for exampleview,norm)
They typically add lines such as ##bcftoolsVersion and ##bcftoolsCommand.
Within bcftools, submodules such as query and stats usually produce tabular/text outputs (not VCF headers), and tabix creates indexes without editing header metadata.
# inspect provenance lines in a VCF header
bcftools view -h input.vcf.gz | rg '^##(bcftools|source)'
bcftools (view, query, norm, stats)
bcftools is one tool with multiple subcommands for filtering, querying, normalization, and QC.
bcftools view
Filter and subset variant records.
# keep PASS variants only
bcftools view -f PASS input.vcf.gz -Oz -o pass.vcf.gz
# keep only biallelic SNPs
bcftools view -m2 -M2 -v snps input.vcf.gz -Oz -o snps.biallelic.vcf.gz
bcftools query
Extract tabular fields for reporting.
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n' input.vcf.gz > variants.tsv
bcftools norm
Normalize and split multiallelic records.
bcftools norm \
-f reference.fasta \
-m -any input.vcf.gz -Oz -o input.norm.vcf.gz
bcftools stats
Generate summary statistics for QC.
bcftools stats input.vcf.gz > input.vcf.stats.txt
tabix
Index bgzipped VCF for region queries.
tabix -p vcf input.vcf.gz
vcf variants bioinformatics file formats
603 Words
2026-03-08 19:00 (Last updated: 2026-03-11 02:45)