FASTQ File Format
FASTQ stores sequencing reads and their quality scores in a single text format.
TL;DR
- FASTQ uses 4 lines per read: header, sequence, separator, quality.
- Sequence and quality strings must be the same length.
- Most modern pipelines assume Sanger/Illumina 1.8+ encoding (
Phred+33). - FASTQ is usually compressed as
.fastq.gz. fastqc,fastp,seqkit, andseqtkare common day-to-day tools.
Structure
Each read is represented by 4 lines:
@header- sequence
+separator- quality string
@read1
ACGTACGTACGT
+
IIIIIIIIIIII
Most FASTQ files contain millions of repeated 4-line records.
Practical Conventions
- Header lines begin with
@and may include instrument/run metadata. +line may repeat the read ID or be just+.- Quality characters encode Phred scores (commonly
Phred+33). - Files are often gzip-compressed (
.fastq.gz) to reduce storage.
Common Pitfalls
- Sequence and quality lengths not matching (invalid record).
- Mixing quality encodings (older
Phred+64vs modernPhred+33). - Truncated files from interrupted transfers/downloads.
- Paired-end files getting out of sync (
R1andR2order mismatch).
Common Uses
- Raw sequencing output
- Read-level QC
- Input for alignment and assembly pipelines
Useful FASTQ Tools
fastqc
Standard read-level quality control report.
fastqc sample_R1.fastq.gz sample_R2.fastq.gz
fastp
Fast all-in-one filtering/trimming with QC outputs.
fastp \
-i sample_R1.fastq.gz -I sample_R2.fastq.gz \
-o sample_R1.clean.fastq.gz -O sample_R2.clean.fastq.gz \
-h fastp.html -j fastp.json
seqkit
Convenient FASTA/FASTQ stats and filtering.
# summary stats
seqkit stats sample.fastq.gz
# keep reads with minimum length 75
seqkit seq -m 75 sample.fastq.gz > sample.min75.fastq
seqtk
Lightweight toolkit for sampling and format conversion.
# subsample reads reproducibly
seqtk sample -s42 sample.fastq.gz 100000 > subset.fastq