FASTQ stores sequencing reads and their quality scores in a single text format.

TL;DR

  • FASTQ uses 4 lines per read: header, sequence, separator, quality.
  • Sequence and quality strings must be the same length.
  • Most modern pipelines assume Sanger/Illumina 1.8+ encoding (Phred+33).
  • FASTQ is usually compressed as .fastq.gz.
  • fastqc, fastp, seqkit, and seqtk are common day-to-day tools.

Structure

Each read is represented by 4 lines:

  1. @ header
  2. sequence
  3. + separator
  4. quality string
@read1
ACGTACGTACGT
+
IIIIIIIIIIII

Most FASTQ files contain millions of repeated 4-line records.

Practical Conventions

  • Header lines begin with @ and may include instrument/run metadata.
  • + line may repeat the read ID or be just +.
  • Quality characters encode Phred scores (commonly Phred+33).
  • Files are often gzip-compressed (.fastq.gz) to reduce storage.

Common Pitfalls

  • Sequence and quality lengths not matching (invalid record).
  • Mixing quality encodings (older Phred+64 vs modern Phred+33).
  • Truncated files from interrupted transfers/downloads.
  • Paired-end files getting out of sync (R1 and R2 order mismatch).

Common Uses

  • Raw sequencing output
  • Read-level QC
  • Input for alignment and assembly pipelines

Useful FASTQ Tools

fastqc

Standard read-level quality control report.

fastqc sample_R1.fastq.gz sample_R2.fastq.gz

fastp

Fast all-in-one filtering/trimming with QC outputs.

fastp \
  -i sample_R1.fastq.gz -I sample_R2.fastq.gz \
  -o sample_R1.clean.fastq.gz -O sample_R2.clean.fastq.gz \
  -h fastp.html -j fastp.json

seqkit

Convenient FASTA/FASTQ stats and filtering.

# summary stats
seqkit stats sample.fastq.gz

# keep reads with minimum length 75
seqkit seq -m 75 sample.fastq.gz > sample.min75.fastq

seqtk

Lightweight toolkit for sampling and format conversion.

# subsample reads reproducibly
seqtk sample -s42 sample.fastq.gz 100000 > subset.fastq