FASTA File Format

FASTA is a plain-text sequence format used to store nucleotide or protein sequences.

TL;DR

FASTA records use > headers followed by sequence lines.
Most tools use the first header token as the sequence ID.
Keep IDs unique and sequence lines clean (no spaces/symbol noise).
samtools faidx enables fast indexed region extraction.
seqkit, seqtk, and bioawk cover common filtering, sampling, and inspection tasks.

A FASTA record begins with a header line starting with > followed by one or more sequence lines.

>seq1 Homo sapiens example
ATGCGTACGTAGCTAGCTAG

Most FASTA files contain many records:

>chr1
NNNNATGCGT...
>chr2
TTGCAAGT...

Header lines (>...) are record identifiers plus optional description.
Sequence is typically uppercase letters (A/C/G/T/N for DNA; amino-acid alphabet for proteins).
Line wrapping is common (for example 60 or 80 chars/line), but not required by many tools.
Large references are often distributed as compressed files (.fa.gz / .fasta.gz).

Duplicate sequence IDs in headers can break downstream tools.
Unexpected characters in sequence lines (spaces, digits, symbols) cause parser errors.
Some tools only use the first token in the header as the ID (before first whitespace).

Fast toolkit for everyday FASTA/FASTQ operations.

# stats summary
seqkit stats sequences.fasta

# filter by minimum sequence length
seqkit seq -m 1000 sequences.fasta > sequences.min1k.fasta

Index FASTA and pull subregions quickly.

# build FASTA index (.fai)
samtools faidx reference.fasta

# extract region
samtools faidx reference.fasta chr1:1000-2000

Lightweight FASTA/FASTQ processing.

# sample 100 sequences (reproducible with seed)
seqtk sample -s42 sequences.fasta 100 > subset.fasta

AWK-style filtering with FASTA awareness.

# print IDs and sequence lengths
bioawk -c fastx '{print $name, length($seq)}' sequences.fasta