2 minutes
FASTA File Format
FASTA is a plain-text sequence format used to store nucleotide or protein sequences.
TL;DR
- FASTA records use
>headers followed by sequence lines. - Most tools use the first header token as the sequence ID.
- Keep IDs unique and sequence lines clean (no spaces/symbol noise).
samtools faidxenables fast indexed region extraction.seqkit,seqtk, andbioawkcover common filtering, sampling, and inspection tasks.
Structure
A FASTA record begins with a header line starting with > followed by one or more sequence lines.
>seq1 Homo sapiens example
ATGCGTACGTAGCTAGCTAG
Most FASTA files contain many records:
>chr1
NNNNATGCGT...
>chr2
TTGCAAGT...
Practical Conventions
- Header lines (
>...) are record identifiers plus optional description. - Sequence is typically uppercase letters (
A/C/G/T/Nfor DNA; amino-acid alphabet for proteins). - Line wrapping is common (for example 60 or 80 chars/line), but not required by many tools.
- Large references are often distributed as compressed files (
.fa.gz/.fasta.gz).
Common Pitfalls
- Duplicate sequence IDs in headers can break downstream tools.
- Unexpected characters in sequence lines (spaces, digits, symbols) cause parser errors.
- Some tools only use the first token in the header as the ID (before first whitespace).
Common Uses
- Reference genomes
- Transcript/protein databases
- Input for alignment and search tools
Useful FASTA Tools
seqkit
Fast toolkit for everyday FASTA/FASTQ operations.
# stats summary
seqkit stats sequences.fasta
# filter by minimum sequence length
seqkit seq -m 1000 sequences.fasta > sequences.min1k.fasta
samtools faidx
Index FASTA and pull subregions quickly.
# build FASTA index (.fai)
samtools faidx reference.fasta
# extract region
samtools faidx reference.fasta chr1:1000-2000
seqtk
Lightweight FASTA/FASTQ processing.
# sample 100 sequences (reproducible with seed)
seqtk sample -s42 sequences.fasta 100 > subset.fasta
bioawk
AWK-style filtering with FASTA awareness.
# print IDs and sequence lengths
bioawk -c fastx '{print $name, length($seq)}' sequences.fasta