FASTA is a plain-text sequence format used to store nucleotide or protein sequences.

TL;DR

  • FASTA records use > headers followed by sequence lines.
  • Most tools use the first header token as the sequence ID.
  • Keep IDs unique and sequence lines clean (no spaces/symbol noise).
  • samtools faidx enables fast indexed region extraction.
  • seqkit, seqtk, and bioawk cover common filtering, sampling, and inspection tasks.

Structure

A FASTA record begins with a header line starting with > followed by one or more sequence lines.

>seq1 Homo sapiens example
ATGCGTACGTAGCTAGCTAG

Most FASTA files contain many records:

>chr1
NNNNATGCGT...
>chr2
TTGCAAGT...

Practical Conventions

  • Header lines (>...) are record identifiers plus optional description.
  • Sequence is typically uppercase letters (A/C/G/T/N for DNA; amino-acid alphabet for proteins).
  • Line wrapping is common (for example 60 or 80 chars/line), but not required by many tools.
  • Large references are often distributed as compressed files (.fa.gz / .fasta.gz).

Common Pitfalls

  • Duplicate sequence IDs in headers can break downstream tools.
  • Unexpected characters in sequence lines (spaces, digits, symbols) cause parser errors.
  • Some tools only use the first token in the header as the ID (before first whitespace).

Common Uses

  • Reference genomes
  • Transcript/protein databases
  • Input for alignment and search tools

Useful FASTA Tools

seqkit

Fast toolkit for everyday FASTA/FASTQ operations.

# stats summary
seqkit stats sequences.fasta

# filter by minimum sequence length
seqkit seq -m 1000 sequences.fasta > sequences.min1k.fasta

samtools faidx

Index FASTA and pull subregions quickly.

# build FASTA index (.fai)
samtools faidx reference.fasta

# extract region
samtools faidx reference.fasta chr1:1000-2000

seqtk

Lightweight FASTA/FASTQ processing.

# sample 100 sequences (reproducible with seed)
seqtk sample -s42 sequences.fasta 100 > subset.fasta

bioawk

AWK-style filtering with FASTA awareness.

# print IDs and sequence lengths
bioawk -c fastx '{print $name, length($seq)}' sequences.fasta