<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>Posts on The Bewildered Bioinformatician</title>
        <link>/posts/</link>
        <description>Recent content in Posts on The Bewildered Bioinformatician</description>
        <generator>Hugo -- gohugo.io</generator>
        <language>en</language>
        <copyright>&lt;a href=&#34;https://creativecommons.org/licenses/by-nc/4.0/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;CC BY-NC 4.0&lt;/a&gt;</copyright>
        <lastBuildDate>Tue, 10 Mar 2026 19:30:00 +0000</lastBuildDate>
        <atom:link href="/posts/index.xml" rel="self" type="application/rss+xml" />
        
        <item>
            <title>FASTA File Format</title>
            <link>/posts/2026/03/fasta-file-format/</link>
            <pubDate>Wed, 04 Mar 2026 19:00:00 +0000</pubDate>
            
            <guid>/posts/2026/03/fasta-file-format/</guid>
            <description>&lt;p&gt;FASTA is a plain-text sequence format used to store nucleotide or protein sequences.&lt;/p&gt;
&lt;h2 id=&#34;tldr&#34;&gt;
  TL;DR
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#tldr&#34; data-anchor=&#34;tldr&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;FASTA records use &lt;code&gt;&amp;gt;&lt;/code&gt; headers followed by sequence lines.&lt;/li&gt;
&lt;li&gt;Most tools use the first header token as the sequence ID.&lt;/li&gt;
&lt;li&gt;Keep IDs unique and sequence lines clean (no spaces/symbol noise).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;samtools faidx&lt;/code&gt; enables fast indexed region extraction.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;seqkit&lt;/code&gt;, &lt;code&gt;seqtk&lt;/code&gt;, and &lt;code&gt;bioawk&lt;/code&gt; cover common filtering, sampling, and inspection tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure&#34;&gt;
  Structure
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#structure&#34; data-anchor=&#34;structure&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;A FASTA record begins with a header line starting with &lt;code&gt;&amp;gt;&lt;/code&gt; followed by one or more sequence lines.&lt;/p&gt;</description>
            <content type="html"><![CDATA[<p>FASTA is a plain-text sequence format used to store nucleotide or protein sequences.</p>
<h2 id="tldr">
  TL;DR
  <a class="heading-anchor" href="#tldr" data-anchor="tldr" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>FASTA records use <code>&gt;</code> headers followed by sequence lines.</li>
<li>Most tools use the first header token as the sequence ID.</li>
<li>Keep IDs unique and sequence lines clean (no spaces/symbol noise).</li>
<li><code>samtools faidx</code> enables fast indexed region extraction.</li>
<li><code>seqkit</code>, <code>seqtk</code>, and <code>bioawk</code> cover common filtering, sampling, and inspection tasks.</li>
</ul>
<h2 id="structure">
  Structure
  <a class="heading-anchor" href="#structure" data-anchor="structure" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<p>A FASTA record begins with a header line starting with <code>&gt;</code> followed by one or more sequence lines.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">&gt;seq1 Homo sapiens example
</span></span><span class="line"><span class="cl">ATGCGTACGTAGCTAGCTAG
</span></span></code></pre></div><p>Most FASTA files contain many records:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">&gt;chr1
</span></span><span class="line"><span class="cl">NNNNATGCGT...
</span></span><span class="line"><span class="cl">&gt;chr2
</span></span><span class="line"><span class="cl">TTGCAAGT...
</span></span></code></pre></div><h2 id="practical-conventions">
  Practical Conventions
  <a class="heading-anchor" href="#practical-conventions" data-anchor="practical-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Header lines (<code>&gt;...</code>) are record identifiers plus optional description.</li>
<li>Sequence is typically uppercase letters (<code>A/C/G/T/N</code> for DNA; amino-acid alphabet for proteins).</li>
<li>Line wrapping is common (for example 60 or 80 chars/line), but not required by many tools.</li>
<li>Large references are often distributed as compressed files (<code>.fa.gz</code> / <code>.fasta.gz</code>).</li>
</ul>
<h2 id="common-pitfalls">
  Common Pitfalls
  <a class="heading-anchor" href="#common-pitfalls" data-anchor="common-pitfalls" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Duplicate sequence IDs in headers can break downstream tools.</li>
<li>Unexpected characters in sequence lines (spaces, digits, symbols) cause parser errors.</li>
<li>Some tools only use the first token in the header as the ID (before first whitespace).</li>
</ul>
<h2 id="common-uses">
  Common Uses
  <a class="heading-anchor" href="#common-uses" data-anchor="common-uses" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Reference genomes</li>
<li>Transcript/protein databases</li>
<li>Input for alignment and search tools</li>
</ul>
<h2 id="useful-fasta-tools">
  Useful FASTA Tools
  <a class="heading-anchor" href="#useful-fasta-tools" data-anchor="useful-fasta-tools" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<h3 id="seqkit">
  seqkit
  <a class="heading-anchor" href="#seqkit" data-anchor="seqkit" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Fast toolkit for everyday FASTA/FASTQ operations.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># stats summary</span>
</span></span><span class="line"><span class="cl">seqkit stats sequences.fasta
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># filter by minimum sequence length</span>
</span></span><span class="line"><span class="cl">seqkit seq -m <span class="m">1000</span> sequences.fasta &gt; sequences.min1k.fasta
</span></span></code></pre></div><h3 id="samtools-faidx">
  samtools faidx
  <a class="heading-anchor" href="#samtools-faidx" data-anchor="samtools-faidx" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Index FASTA and pull subregions quickly.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># build FASTA index (.fai)</span>
</span></span><span class="line"><span class="cl">samtools faidx reference.fasta
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># extract region</span>
</span></span><span class="line"><span class="cl">samtools faidx reference.fasta chr1:1000-2000
</span></span></code></pre></div><h3 id="seqtk">
  seqtk
  <a class="heading-anchor" href="#seqtk" data-anchor="seqtk" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Lightweight FASTA/FASTQ processing.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># sample 100 sequences (reproducible with seed)</span>
</span></span><span class="line"><span class="cl">seqtk sample -s42 sequences.fasta <span class="m">100</span> &gt; subset.fasta
</span></span></code></pre></div><h3 id="bioawk">
  bioawk
  <a class="heading-anchor" href="#bioawk" data-anchor="bioawk" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>AWK-style filtering with FASTA awareness.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># print IDs and sequence lengths</span>
</span></span><span class="line"><span class="cl">bioawk -c fastx <span class="s1">&#39;{print $name, length($seq)}&#39;</span> sequences.fasta
</span></span></code></pre></div>]]></content>
        </item>
        
        <item>
            <title>FASTQ File Format</title>
            <link>/posts/2026/03/fastq-file-format/</link>
            <pubDate>Thu, 05 Mar 2026 19:00:00 +0000</pubDate>
            
            <guid>/posts/2026/03/fastq-file-format/</guid>
            <description>&lt;p&gt;FASTQ stores sequencing reads and their quality scores in a single text format.&lt;/p&gt;
&lt;h2 id=&#34;tldr&#34;&gt;
  TL;DR
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#tldr&#34; data-anchor=&#34;tldr&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;FASTQ uses 4 lines per read: header, sequence, separator, quality.&lt;/li&gt;
&lt;li&gt;Sequence and quality strings must be the same length.&lt;/li&gt;
&lt;li&gt;Most modern pipelines assume Sanger/Illumina 1.8+ encoding (&lt;code&gt;Phred+33&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;FASTQ is usually compressed as &lt;code&gt;.fastq.gz&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fastqc&lt;/code&gt;, &lt;code&gt;fastp&lt;/code&gt;, &lt;code&gt;seqkit&lt;/code&gt;, and &lt;code&gt;seqtk&lt;/code&gt; are common day-to-day tools.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure&#34;&gt;
  Structure
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#structure&#34; data-anchor=&#34;structure&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;Each read is represented by 4 lines:&lt;/p&gt;</description>
            <content type="html"><![CDATA[<p>FASTQ stores sequencing reads and their quality scores in a single text format.</p>
<h2 id="tldr">
  TL;DR
  <a class="heading-anchor" href="#tldr" data-anchor="tldr" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>FASTQ uses 4 lines per read: header, sequence, separator, quality.</li>
<li>Sequence and quality strings must be the same length.</li>
<li>Most modern pipelines assume Sanger/Illumina 1.8+ encoding (<code>Phred+33</code>).</li>
<li>FASTQ is usually compressed as <code>.fastq.gz</code>.</li>
<li><code>fastqc</code>, <code>fastp</code>, <code>seqkit</code>, and <code>seqtk</code> are common day-to-day tools.</li>
</ul>
<h2 id="structure">
  Structure
  <a class="heading-anchor" href="#structure" data-anchor="structure" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<p>Each read is represented by 4 lines:</p>
<ol>
<li><code>@</code> header</li>
<li>sequence</li>
<li><code>+</code> separator</li>
<li>quality string</li>
</ol>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">@read1
</span></span><span class="line"><span class="cl">ACGTACGTACGT
</span></span><span class="line"><span class="cl">+
</span></span><span class="line"><span class="cl">IIIIIIIIIIII
</span></span></code></pre></div><p>Most FASTQ files contain millions of repeated 4-line records.</p>
<h2 id="practical-conventions">
  Practical Conventions
  <a class="heading-anchor" href="#practical-conventions" data-anchor="practical-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Header lines begin with <code>@</code> and may include instrument/run metadata.</li>
<li><code>+</code> line may repeat the read ID or be just <code>+</code>.</li>
<li>Quality characters encode Phred scores (commonly <code>Phred+33</code>).</li>
<li>Files are often gzip-compressed (<code>.fastq.gz</code>) to reduce storage.</li>
</ul>
<h2 id="common-pitfalls">
  Common Pitfalls
  <a class="heading-anchor" href="#common-pitfalls" data-anchor="common-pitfalls" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Sequence and quality lengths not matching (invalid record).</li>
<li>Mixing quality encodings (older <code>Phred+64</code> vs modern <code>Phred+33</code>).</li>
<li>Truncated files from interrupted transfers/downloads.</li>
<li>Paired-end files getting out of sync (<code>R1</code> and <code>R2</code> order mismatch).</li>
</ul>
<h2 id="common-uses">
  Common Uses
  <a class="heading-anchor" href="#common-uses" data-anchor="common-uses" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Raw sequencing output</li>
<li>Read-level QC</li>
<li>Input for alignment and assembly pipelines</li>
</ul>
<h2 id="useful-fastq-tools">
  Useful FASTQ Tools
  <a class="heading-anchor" href="#useful-fastq-tools" data-anchor="useful-fastq-tools" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<h3 id="fastqc">
  fastqc
  <a class="heading-anchor" href="#fastqc" data-anchor="fastqc" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Standard read-level quality control report.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">fastqc sample_R1.fastq.gz sample_R2.fastq.gz
</span></span></code></pre></div><h3 id="fastp">
  fastp
  <a class="heading-anchor" href="#fastp" data-anchor="fastp" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Fast all-in-one filtering/trimming with QC outputs.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">fastp <span class="se">\
</span></span></span><span class="line"><span class="cl">  -i sample_R1.fastq.gz -I sample_R2.fastq.gz <span class="se">\
</span></span></span><span class="line"><span class="cl">  -o sample_R1.clean.fastq.gz -O sample_R2.clean.fastq.gz <span class="se">\
</span></span></span><span class="line"><span class="cl">  -h fastp.html -j fastp.json
</span></span></code></pre></div><h3 id="seqkit">
  seqkit
  <a class="heading-anchor" href="#seqkit" data-anchor="seqkit" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Convenient FASTA/FASTQ stats and filtering.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># summary stats</span>
</span></span><span class="line"><span class="cl">seqkit stats sample.fastq.gz
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># keep reads with minimum length 75</span>
</span></span><span class="line"><span class="cl">seqkit seq -m <span class="m">75</span> sample.fastq.gz &gt; sample.min75.fastq
</span></span></code></pre></div><h3 id="seqtk">
  seqtk
  <a class="heading-anchor" href="#seqtk" data-anchor="seqtk" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Lightweight toolkit for sampling and format conversion.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># subsample reads reproducibly</span>
</span></span><span class="line"><span class="cl">seqtk sample -s42 sample.fastq.gz <span class="m">100000</span> &gt; subset.fastq
</span></span></code></pre></div>]]></content>
        </item>
        
        <item>
            <title>BED File Format</title>
            <link>/posts/2026/03/bed-file-format/</link>
            <pubDate>Sat, 07 Mar 2026 19:00:00 +0000</pubDate>
            
            <guid>/posts/2026/03/bed-file-format/</guid>
            <description>&lt;p&gt;BED is a tab-delimited format for genomic intervals.&lt;/p&gt;
&lt;h2 id=&#34;tldr&#34;&gt;
  TL;DR
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#tldr&#34; data-anchor=&#34;tldr&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;BED uses 0-based, half-open coordinates: &lt;code&gt;[start, end)&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Minimum valid BED record is 3 fields: &lt;code&gt;chrom&lt;/code&gt;, &lt;code&gt;start&lt;/code&gt;, &lt;code&gt;end&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;BED6 (&lt;code&gt;+ name, score, strand&lt;/code&gt;) is a common interoperable subset.&lt;/li&gt;
&lt;li&gt;BED12 extends BED for exon/block models (for example transcripts).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bedtools&lt;/code&gt; is the standard toolkit for overlap, merge, subtract, and coverage operations.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gxf2bed&lt;/code&gt; is useful for converting GTF/GFF annotations into BED coordinates.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure&#34;&gt;
  Structure
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#structure&#34; data-anchor=&#34;structure&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;A BED file is plain text, one interval per line, tab-delimited.&lt;/p&gt;</description>
            <content type="html"><![CDATA[<p>BED is a tab-delimited format for genomic intervals.</p>
<h2 id="tldr">
  TL;DR
  <a class="heading-anchor" href="#tldr" data-anchor="tldr" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>BED uses 0-based, half-open coordinates: <code>[start, end)</code>.</li>
<li>Minimum valid BED record is 3 fields: <code>chrom</code>, <code>start</code>, <code>end</code>.</li>
<li>BED6 (<code>+ name, score, strand</code>) is a common interoperable subset.</li>
<li>BED12 extends BED for exon/block models (for example transcripts).</li>
<li><code>bedtools</code> is the standard toolkit for overlap, merge, subtract, and coverage operations.</li>
<li><code>gxf2bed</code> is useful for converting GTF/GFF annotations into BED coordinates.</li>
</ul>
<h2 id="structure">
  Structure
  <a class="heading-anchor" href="#structure" data-anchor="structure" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<p>A BED file is plain text, one interval per line, tab-delimited.</p>
<p>Required BED3 fields:</p>
<ol>
<li>chromosome</li>
<li>start (0-based)</li>
<li>end (half-open)</li>
</ol>
<p>Simple BED3 example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">chr1	1000	2000
</span></span><span class="line"><span class="cl">chr1	2500	2600
</span></span><span class="line"><span class="cl">chr2	150	400
</span></span></code></pre></div><p>Common BED6 example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">chr1	1000	2000	peak_001	850	+
</span></span><span class="line"><span class="cl">chr1	2500	2600	peak_002	420	-
</span></span></code></pre></div><p>BED12 adds block structure (used for transcript/exon-style annotations), including fields like <code>thickStart</code>, <code>thickEnd</code>, <code>itemRgb</code>, <code>blockCount</code>, <code>blockSizes</code>, and <code>blockStarts</code>.</p>
<h3 id="coordinate-model-important">
  Coordinate model (important)
  <a class="heading-anchor" href="#coordinate-model-important" data-anchor="coordinate-model-important" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>BED uses 0-based start and end-exclusive coordinates.</p>
<ul>
<li>Interval <code>chr1 100 200</code> includes bases 101..200 in 1-based closed notation.</li>
<li>Length is <code>end - start</code> (here: <code>100</code> bp).</li>
</ul>
<h2 id="practical-conventions">
  Practical Conventions
  <a class="heading-anchor" href="#practical-conventions" data-anchor="practical-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Use tabs, not spaces, between columns.</li>
<li>Keep chromosome naming consistent (<code>chr1</code> vs <code>1</code>) across all files in a workflow.</li>
<li>Ensure <code>start &lt; end</code> for every interval.</li>
<li>Use sorted BED for reproducible downstream behavior.</li>
<li>Prefer BED6 when strand-aware operations are needed.</li>
</ul>
<h2 id="common-pitfalls">
  Common Pitfalls
  <a class="heading-anchor" href="#common-pitfalls" data-anchor="common-pitfalls" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Mixing coordinate systems (BED 0-based vs GFF/GTF often 1-based).</li>
<li>Unsorted BED files causing unexpected merge/intersect results.</li>
<li>Inconsistent chromosome naming between BED and reference/alignment files.</li>
<li>Invalid intervals (<code>start &gt;= end</code>) or negative starts.</li>
<li>Assuming all tools interpret optional columns the same way.</li>
</ul>
<h2 id="common-uses">
  Common Uses
  <a class="heading-anchor" href="#common-uses" data-anchor="common-uses" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Region annotation</li>
<li>Peak calls</li>
<li>Intersections and interval operations</li>
</ul>
<h2 id="useful-bed-tools">
  Useful BED Tools
  <a class="heading-anchor" href="#useful-bed-tools" data-anchor="useful-bed-tools" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<h3 id="bedtools-sort">
  bedtools sort
  <a class="heading-anchor" href="#bedtools-sort" data-anchor="bedtools-sort" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Sort BED intervals for stable downstream processing.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">bedtools sort -i regions.bed &gt; regions.sorted.bed
</span></span></code></pre></div><h3 id="bedtools-intersect">
  bedtools intersect
  <a class="heading-anchor" href="#bedtools-intersect" data-anchor="bedtools-intersect" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Find overlaps between two interval sets.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">bedtools intersect <span class="se">\
</span></span></span><span class="line"><span class="cl">  -a peaks.sorted.bed <span class="se">\
</span></span></span><span class="line"><span class="cl">  -b genes.sorted.bed <span class="se">\
</span></span></span><span class="line"><span class="cl">  -wa -wb &gt; peaks_in_genes.tsv
</span></span></code></pre></div><h3 id="bedtools-merge">
  bedtools merge
  <a class="heading-anchor" href="#bedtools-merge" data-anchor="bedtools-merge" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Merge overlapping or adjacent intervals.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">bedtools merge -i regions.sorted.bed &gt; regions.merged.bed
</span></span></code></pre></div><h3 id="bedtools-coverage">
  bedtools coverage
  <a class="heading-anchor" href="#bedtools-coverage" data-anchor="bedtools-coverage" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Compute coverage of features over targets.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">bedtools coverage <span class="se">\
</span></span></span><span class="line"><span class="cl">  -a targets.sorted.bed <span class="se">\
</span></span></span><span class="line"><span class="cl">  -b alignments.bed &gt; target_coverage.tsv
</span></span></code></pre></div><h3 id="gxf2bed">
  gxf2bed
  <a class="heading-anchor" href="#gxf2bed" data-anchor="gxf2bed" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Convert GTF/GFF-style annotation files to BED.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># convert GTF to BED</span>
</span></span><span class="line"><span class="cl">gxf2bed annotations.gtf &gt; annotations.bed
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># convert GFF3 to BED</span>
</span></span><span class="line"><span class="cl">gxf2bed annotations.gff3 &gt; annotations.bed
</span></span></code></pre></div>]]></content>
        </item>
        
        <item>
            <title>BAM File Format</title>
            <link>/posts/2026/03/bam-file-format/</link>
            <pubDate>Fri, 06 Mar 2026 19:00:00 +0000</pubDate>
            
            <guid>/posts/2026/03/bam-file-format/</guid>
            <description>&lt;p&gt;BAM is the compressed binary form of SAM, used for aligned sequencing reads.&lt;/p&gt;
&lt;h2 id=&#34;tldr&#34;&gt;
  TL;DR
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#tldr&#34; data-anchor=&#34;tldr&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;BAM is binary/compressed SAM, so it is smaller and faster to process at scale.&lt;/li&gt;
&lt;li&gt;Coordinate-sorted BAM plus index (&lt;code&gt;.bai&lt;/code&gt; or &lt;code&gt;.csi&lt;/code&gt;) enables fast random-access region queries.&lt;/li&gt;
&lt;li&gt;Core SAM fields (especially &lt;code&gt;CIGAR&lt;/code&gt;) describe how each read aligns to the reference.&lt;/li&gt;
&lt;li&gt;Optional tags carry rich metadata, including alignment metrics (&lt;code&gt;NM&lt;/code&gt;, &lt;code&gt;MD&lt;/code&gt;, &lt;code&gt;AS&lt;/code&gt;) and modified bases (&lt;code&gt;MM&lt;/code&gt;, &lt;code&gt;ML&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;RG&lt;/code&gt; tags on reads map to header read groups, where &lt;code&gt;SM&lt;/code&gt; defines the sample name used by many downstream tools.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;samtools&lt;/code&gt; is the standard toolkit for sorting, indexing, filtering, and summary stats.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure&#34;&gt;
  Structure
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#structure&#34; data-anchor=&#34;structure&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;A BAM file contains:&lt;/p&gt;</description>
            <content type="html"><![CDATA[<p>BAM is the compressed binary form of SAM, used for aligned sequencing reads.</p>
<h2 id="tldr">
  TL;DR
  <a class="heading-anchor" href="#tldr" data-anchor="tldr" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>BAM is binary/compressed SAM, so it is smaller and faster to process at scale.</li>
<li>Coordinate-sorted BAM plus index (<code>.bai</code> or <code>.csi</code>) enables fast random-access region queries.</li>
<li>Core SAM fields (especially <code>CIGAR</code>) describe how each read aligns to the reference.</li>
<li>Optional tags carry rich metadata, including alignment metrics (<code>NM</code>, <code>MD</code>, <code>AS</code>) and modified bases (<code>MM</code>, <code>ML</code>).</li>
<li><code>RG</code> tags on reads map to header read groups, where <code>SM</code> defines the sample name used by many downstream tools.</li>
<li><code>samtools</code> is the standard toolkit for sorting, indexing, filtering, and summary stats.</li>
</ul>
<h2 id="structure">
  Structure
  <a class="heading-anchor" href="#structure" data-anchor="structure" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<p>A BAM file contains:</p>
<ol>
<li>Header (SAM-style metadata, binary encoded)</li>
<li>Alignment records (one per read/alignment)</li>
</ol>
<p>Conceptually, BAM represents the same fields as SAM (<code>QNAME</code>, <code>FLAG</code>, <code>RNAME</code>, <code>POS</code>, <code>MAPQ</code>, <code>CIGAR</code>, etc.), but in compressed binary form.</p>
<p>Minimal SAM-style example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">@HD	VN:1.6	SO:coordinate
</span></span><span class="line"><span class="cl">@SQ	SN:chr1	LN:248956422
</span></span><span class="line"><span class="cl">read0001	99	chr1	10001	60	10M1I15M2D24M	=	10120	180	ACGTT...	IIII...	NM:i:3
</span></span></code></pre></div><ul>
<li><code>@HD</code> and <code>@SQ</code> are header lines (format version, sort order, reference dictionary).</li>
<li>The alignment line is tab-delimited and includes core fields plus optional tags (for example <code>NM:i:3</code>).</li>
</ul>
<h3 id="cigar-quick-explanation">
  CIGAR quick explanation
  <a class="heading-anchor" href="#cigar-quick-explanation" data-anchor="cigar-quick-explanation" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>The CIGAR string encodes how the read aligns to the reference.</p>
<ul>
<li><code>M</code>: alignment match/mismatch block</li>
<li><code>I</code>: insertion in read relative to reference</li>
<li><code>D</code>: deletion from read relative to reference</li>
<li><code>S</code>: soft clipping (bases present in read, not aligned)</li>
<li><code>H</code>: hard clipping (bases removed from stored sequence)</li>
<li><code>N</code>: skipped region on reference (common in RNA-seq introns)</li>
</ul>
<p>For <code>10M1I15M2D24M</code>:</p>
<ul>
<li>10 aligned bases, then 1 inserted base, then 15 aligned bases, then a 2-base deletion on the reference, then 24 aligned bases.</li>
</ul>
<h3 id="modified-bases-in-sambam-tags">
  Modified bases in SAM/BAM tags
  <a class="heading-anchor" href="#modified-bases-in-sambam-tags" data-anchor="modified-bases-in-sambam-tags" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Modified bases are typically stored in optional tags, most commonly:</p>
<ul>
<li><code>MM</code>: modification type + positions (delta encoded)</li>
<li><code>ML</code>: per-site modification probabilities (byte scale)</li>
</ul>
<p>Example (schematic):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">read0002	0	chr1	20501	50	30M	*	0	0	ACGTC...	IIII...	MM:Z:C+m,5,12;	ML:B:C,220,180
</span></span></code></pre></div><ul>
<li><code>C+m</code> means 5mC calls on cytosines.</li>
<li>Position list (<code>5,12</code>) indicates candidate modified-base offsets along the read for that base type.</li>
<li><code>ML</code> values (here <code>220,180</code>) are confidence/probability values corresponding to the <code>MM</code> calls.</li>
</ul>
<p>Exact interpretation can vary by basecaller/tool version, so always confirm against your caller&rsquo;s spec when parsing modified-base tags.</p>
<h3 id="rg-and-sm-tags-sample-metadata">
  <code>RG</code> and <code>SM</code> tags (sample metadata)
  <a class="heading-anchor" href="#rg-and-sm-tags-sample-metadata" data-anchor="rg-and-sm-tags-sample-metadata" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<ul>
<li><code>RG:Z:&lt;id&gt;</code> appears on alignment records and points to a read group ID.</li>
<li><code>SM:&lt;sample_name&gt;</code> appears in header <code>@RG</code> lines and defines the biological sample for that read group.</li>
</ul>
<p>Minimal header + record example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">@RG	ID:rg001	SM:NA12878	PL:ILLUMINA
</span></span><span class="line"><span class="cl">read0003	99	chr1	30001	60	50M	=	30100	149	ACGT...	IIII...	RG:Z:rg001
</span></span></code></pre></div><p>In practice, many downstream tools use <code>SM</code> for per-sample aggregation and <code>RG</code> for lane/library/platform-aware processing (for example duplicate marking and BQSR).</p>
<h2 id="practical-conventions">
  Practical Conventions
  <a class="heading-anchor" href="#practical-conventions" data-anchor="practical-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>BAM files are usually coordinate-sorted for downstream analysis workflows.</li>
<li>Indexed BAM (<code>samtools index</code>) is expected by genome browsers and region-based tools.</li>
<li>Headers often include <code>@SQ</code> (reference names/lengths), <code>@RG</code> (read groups), and <code>@PG</code> (pipeline provenance).</li>
<li>Large or highly fragmented references may use <code>.csi</code> indexes instead of <code>.bai</code>.</li>
</ul>
<h3 id="header-sanity-checks">
  Header sanity checks
  <a class="heading-anchor" href="#header-sanity-checks" data-anchor="header-sanity-checks" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Before downstream analysis, header inspection is a fast sanity check for common integration issues.</p>
<ul>
<li><code>@HD</code> confirms sort order (<code>SO:coordinate</code> expected in many workflows).</li>
<li><code>@SQ</code> verifies contig names and lengths against your reference FASTA.</li>
<li><code>@RG</code> confirms read-group/sample metadata (for example <code>SM</code>) is present.</li>
<li><code>@PG</code> provides pipeline provenance (what tools/versions touched the BAM).</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># inspect BAM header sections</span>
</span></span><span class="line"><span class="cl">samtools view -H sample.bam <span class="p">|</span> rg <span class="s1">&#39;^@HD|^@SQ|^@RG|^@PG&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># quick check for coordinate-sorted header</span>
</span></span><span class="line"><span class="cl">samtools view -H sample.bam <span class="p">|</span> rg <span class="s1">&#39;^@HD&#39;</span>
</span></span></code></pre></div><h3 id="why-indexed-bam-is-useful">
  Why indexed BAM is useful
  <a class="heading-anchor" href="#why-indexed-bam-is-useful" data-anchor="why-indexed-bam-is-useful" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<ul>
<li>Fast random access to specific loci/regions without scanning the entire file.</li>
<li>Lower I/O and compute for region-based QC, visualization, and variant workflows.</li>
<li>Enables interactive browsing in tools like IGV/JBrowse with near-instant jumps.</li>
<li>Essential for cloud/remote workflows where minimizing transferred bytes matters.</li>
</ul>
<h2 id="common-pitfalls">
  Common Pitfalls
  <a class="heading-anchor" href="#common-pitfalls" data-anchor="common-pitfalls" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Using an unsorted BAM where a sorted BAM is required (causes tool errors or wrong assumptions).</li>
<li>Missing or stale index after replacing/updating a BAM file.</li>
<li>Header/reference mismatch between BAM and reference FASTA used downstream.</li>
<li>Unexpected duplicate/secondary/supplementary alignments when counting reads naively.</li>
<li>Ignoring mapping quality (<code>MAPQ</code>) or duplicate flags in variant/QC analyses.</li>
</ul>
<h2 id="common-uses">
  Common Uses
  <a class="heading-anchor" href="#common-uses" data-anchor="common-uses" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Read alignment storage</li>
<li>Variant calling workflows</li>
<li>Coverage and mapping quality analysis</li>
</ul>
<h2 id="useful-bam-tools">
  Useful BAM Tools
  <a class="heading-anchor" href="#useful-bam-tools" data-anchor="useful-bam-tools" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<h3 id="samtools-view">
  samtools view
  <a class="heading-anchor" href="#samtools-view" data-anchor="samtools-view" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Inspect, filter, and convert between BAM/SAM/CRAM.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># quick header + first alignments (SAM text view)</span>
</span></span><span class="line"><span class="cl">samtools view -h sample.bam <span class="p">|</span> head
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># keep mapped primary alignments with MAPQ &gt;= 20</span>
</span></span><span class="line"><span class="cl">samtools view -b -F 0x904 -q <span class="m">20</span> sample.bam &gt; sample.mapq20.primary.bam
</span></span></code></pre></div><h3 id="samtools-sort">
  samtools sort
  <a class="heading-anchor" href="#samtools-sort" data-anchor="samtools-sort" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Sort BAM records (typically by coordinate).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">samtools sort -o sample.sorted.bam sample.bam
</span></span></code></pre></div><h3 id="samtools-index">
  samtools index
  <a class="heading-anchor" href="#samtools-index" data-anchor="samtools-index" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Build BAM index for random access.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">samtools index sample.sorted.bam
</span></span></code></pre></div><h3 id="samtools-flagstat">
  samtools flagstat
  <a class="heading-anchor" href="#samtools-flagstat" data-anchor="samtools-flagstat" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Generate quick mapping/alignment summary statistics.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">samtools flagstat sample.sorted.bam
</span></span></code></pre></div><h3 id="cramino">
  cramino
  <a class="heading-anchor" href="#cramino" data-anchor="cramino" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Fast long-read alignment QC summaries (commonly used with ONT BAM/CRAM files).</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># generate a basic cramino report from a BAM</span>
</span></span><span class="line"><span class="cl">cramino sample.sorted.bam &gt; sample.cramino.tsv
</span></span></code></pre></div><h3 id="nanocov">
  nanocov
  <a class="heading-anchor" href="#nanocov" data-anchor="nanocov" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Per-base coverage statistics and plots from BAM files.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># basic run</span>
</span></span><span class="line"><span class="cl">nanocov <span class="se">\
</span></span></span><span class="line"><span class="cl">  --input sample.sorted.bam <span class="se">\
</span></span></span><span class="line"><span class="cl">  --output-dir nanocov_out
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># coverage on target regions from BED</span>
</span></span><span class="line"><span class="cl">nanocov <span class="se">\
</span></span></span><span class="line"><span class="cl">  --input sample.sorted.bam <span class="se">\
</span></span></span><span class="line"><span class="cl">  --bed targets.bed <span class="se">\
</span></span></span><span class="line"><span class="cl">  --output-dir nanocov_targets <span class="se">\
</span></span></span><span class="line"><span class="cl">  --prefix sample1
</span></span></code></pre></div>]]></content>
        </item>
        
        <item>
            <title>VCF File Format</title>
            <link>/posts/2026/03/vcf-file-format/</link>
            <pubDate>Sun, 08 Mar 2026 19:00:00 +0000</pubDate>
            
            <guid>/posts/2026/03/vcf-file-format/</guid>
            <description>&lt;p&gt;VCF (Variant Call Format) stores genomic variants and metadata.&lt;/p&gt;
&lt;h2 id=&#34;tldr&#34;&gt;
  TL;DR
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#tldr&#34; data-anchor=&#34;tldr&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;VCF is a tab-delimited text format for sequence variants.&lt;/li&gt;
&lt;li&gt;Core columns are fixed (&lt;code&gt;CHROM&lt;/code&gt;, &lt;code&gt;POS&lt;/code&gt;, &lt;code&gt;ID&lt;/code&gt;, &lt;code&gt;REF&lt;/code&gt;, &lt;code&gt;ALT&lt;/code&gt;, &lt;code&gt;QUAL&lt;/code&gt;, &lt;code&gt;FILTER&lt;/code&gt;, &lt;code&gt;INFO&lt;/code&gt;), with optional sample genotype columns.&lt;/li&gt;
&lt;li&gt;Coordinates are 1-based, and complex alleles are anchored with a left reference base.&lt;/li&gt;
&lt;li&gt;Most production workflows use bgzipped VCF (&lt;code&gt;.vcf.gz&lt;/code&gt;) with a tabix index (&lt;code&gt;.tbi&lt;/code&gt;/&lt;code&gt;.csi&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bcftools&lt;/code&gt; is the standard toolkit for filtering, querying, normalization, and stats.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure&#34;&gt;
  Structure
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#structure&#34; data-anchor=&#34;structure&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;A VCF file includes:&lt;/p&gt;</description>
            <content type="html"><![CDATA[<p>VCF (Variant Call Format) stores genomic variants and metadata.</p>
<h2 id="tldr">
  TL;DR
  <a class="heading-anchor" href="#tldr" data-anchor="tldr" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>VCF is a tab-delimited text format for sequence variants.</li>
<li>Core columns are fixed (<code>CHROM</code>, <code>POS</code>, <code>ID</code>, <code>REF</code>, <code>ALT</code>, <code>QUAL</code>, <code>FILTER</code>, <code>INFO</code>), with optional sample genotype columns.</li>
<li>Coordinates are 1-based, and complex alleles are anchored with a left reference base.</li>
<li>Most production workflows use bgzipped VCF (<code>.vcf.gz</code>) with a tabix index (<code>.tbi</code>/<code>.csi</code>).</li>
<li><code>bcftools</code> is the standard toolkit for filtering, querying, normalization, and stats.</li>
</ul>
<h2 id="structure">
  Structure
  <a class="heading-anchor" href="#structure" data-anchor="structure" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<p>A VCF file includes:</p>
<ul>
<li>metadata lines beginning with <code>##</code></li>
<li>one header line beginning with <code>#CHROM</code></li>
<li>one row per variant record</li>
</ul>
<p>Minimal example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">##fileformat=VCFv4.3
</span></span><span class="line"><span class="cl">##contig=&lt;ID=chr1,length=248956422&gt;
</span></span><span class="line"><span class="cl">##INFO=&lt;ID=DP,Number=1,Type=Integer,Description=&#34;Total Depth&#34;&gt;
</span></span><span class="line"><span class="cl">##FORMAT=&lt;ID=GT,Number=1,Type=String,Description=&#34;Genotype&#34;&gt;
</span></span><span class="line"><span class="cl">#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	sample1
</span></span><span class="line"><span class="cl">chr1	10177	rs367896724	A	AC	100	PASS	DP=14	GT	0/1
</span></span></code></pre></div><h3 id="fixed-columns">
  Fixed columns
  <a class="heading-anchor" href="#fixed-columns" data-anchor="fixed-columns" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<ol>
<li><code>CHROM</code>: chromosome/contig</li>
<li><code>POS</code>: 1-based position</li>
<li><code>ID</code>: variant identifier (or <code>.</code>)</li>
<li><code>REF</code>: reference allele</li>
<li><code>ALT</code>: alternate allele(s), comma-separated if multiallelic</li>
<li><code>QUAL</code>: variant quality (or <code>.</code>)</li>
<li><code>FILTER</code>: <code>PASS</code> or filter labels</li>
<li><code>INFO</code>: semicolon-delimited key-value annotations</li>
</ol>
<p>Optional sample columns begin with <code>FORMAT</code>, then one column per sample.</p>
<h3 id="genotype-fields">
  Genotype fields
  <a class="heading-anchor" href="#genotype-fields" data-anchor="genotype-fields" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p><code>FORMAT</code> defines per-sample subfields (for example <code>GT:DP:AD:GQ</code>).</p>
<p>Example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	s1	s2
</span></span><span class="line"><span class="cl">chr2	200123	.	G	A	60	PASS	DP=52	GT:DP:AD:GQ	0/1:25:12,13:99	0/0:27:27,0:99
</span></span></code></pre></div><ul>
<li><code>GT</code>: genotype (<code>0/0</code>, <code>0/1</code>, <code>1/1</code>, <code>./.</code>)</li>
<li><code>DP</code>: sample read depth</li>
<li><code>AD</code>: allele depths (<code>REF,ALT1,ALT2,...</code>)</li>
<li><code>GQ</code>: genotype quality</li>
</ul>
<h3 id="coordinate-and-allele-conventions">
  Coordinate and allele conventions
  <a class="heading-anchor" href="#coordinate-and-allele-conventions" data-anchor="coordinate-and-allele-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<ul>
<li>SNP example: <code>REF=A</code>, <code>ALT=G</code> at <code>POS</code>.</li>
<li>Insertion/deletion alleles are left-anchored by convention.</li>
<li>For indels and multiallelic sites, normalization (left-align + split) is often required before comparison/merge.</li>
</ul>
<h2 id="practical-conventions">
  Practical Conventions
  <a class="heading-anchor" href="#practical-conventions" data-anchor="practical-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Compress and index VCFs for fast random access:
<code>.vcf.gz</code> + <code>.tbi</code>/<code>.csi</code>.</li>
<li>Keep contig names consistent across reference, BAM/CRAM, and VCF files.</li>
<li>Use normalized, decomposed records for robust comparisons.</li>
<li>Preserve full metadata (<code>##INFO</code>, <code>##FORMAT</code>, <code>##FILTER</code>) when transforming files.</li>
<li>Validate sample order and identity before cohort merges.</li>
</ul>
<h3 id="header-sanity-checks">
  Header sanity checks
  <a class="heading-anchor" href="#header-sanity-checks" data-anchor="header-sanity-checks" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>VCF headers are a quick preflight check before filtering, merging, or annotation.</p>
<ul>
<li><code>##fileformat</code> confirms parser compatibility expectations.</li>
<li><code>##contig</code> should match your reference naming/length conventions.</li>
<li><code>##INFO</code>, <code>##FORMAT</code>, and <code>##FILTER</code> definitions should exist for used fields.</li>
<li><code>#CHROM ...</code> sample columns should match expected sample IDs/order.</li>
</ul>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># inspect key header metadata</span>
</span></span><span class="line"><span class="cl">bcftools view -h input.vcf.gz <span class="p">|</span> rg <span class="s1">&#39;^##(fileformat|contig|INFO|FORMAT|FILTER)&#39;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># list sample names in header order</span>
</span></span><span class="line"><span class="cl">bcftools query -l input.vcf.gz
</span></span></code></pre></div><h2 id="common-pitfalls">
  Common Pitfalls
  <a class="heading-anchor" href="#common-pitfalls" data-anchor="common-pitfalls" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Mixing <code>chr</code> and non-<code>chr</code> contig naming across inputs.</li>
<li>Comparing unnormalized indels and getting false discordance.</li>
<li>Dropping header metadata during manual edits/reformatting.</li>
<li>Misinterpreting missing values (<code>.</code>) vs true zero values.</li>
<li>Treating <code>QUAL</code> as equivalent to genotype quality (<code>GQ</code>).</li>
</ul>
<h2 id="common-uses">
  Common Uses
  <a class="heading-anchor" href="#common-uses" data-anchor="common-uses" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>SNP/indel call representation</li>
<li>Variant filtering and annotation</li>
<li>Cohort and population analyses</li>
</ul>
<h2 id="useful-vcf-tools">
  Useful VCF Tools
  <a class="heading-anchor" href="#useful-vcf-tools" data-anchor="useful-vcf-tools" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<h3 id="header-provenance-run-logs">
  Header provenance (run logs)
  <a class="heading-anchor" href="#header-provenance-run-logs" data-anchor="header-provenance-run-logs" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>From the 2 tools listed below (<code>bcftools</code> and <code>tabix</code>), <strong>1 tool</strong> commonly writes command provenance into VCF headers when producing VCF/BCF output:</p>
<ul>
<li><code>bcftools</code> (for example <code>view</code>, <code>norm</code>)</li>
</ul>
<p>They typically add lines such as <code>##bcftoolsVersion</code> and <code>##bcftoolsCommand</code>.
Within <code>bcftools</code>, submodules such as <code>query</code> and <code>stats</code> usually produce tabular/text outputs (not VCF headers), and <code>tabix</code> creates indexes without editing header metadata.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># inspect provenance lines in a VCF header</span>
</span></span><span class="line"><span class="cl">bcftools view -h input.vcf.gz <span class="p">|</span> rg <span class="s1">&#39;^##(bcftools|source)&#39;</span>
</span></span></code></pre></div><h3 id="bcftools-view-query-norm-stats">
  bcftools (<code>view</code>, <code>query</code>, <code>norm</code>, <code>stats</code>)
  <a class="heading-anchor" href="#bcftools-view-query-norm-stats" data-anchor="bcftools-view-query-norm-stats" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p><code>bcftools</code> is one tool with multiple subcommands for filtering, querying, normalization, and QC.</p>
<h4 id="bcftools-view">
  bcftools view
  <a class="heading-anchor" href="#bcftools-view" data-anchor="bcftools-view" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h4>
<p>Filter and subset variant records.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># keep PASS variants only</span>
</span></span><span class="line"><span class="cl">bcftools view -f PASS input.vcf.gz -Oz -o pass.vcf.gz
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># keep only biallelic SNPs</span>
</span></span><span class="line"><span class="cl">bcftools view -m2 -M2 -v snps input.vcf.gz -Oz -o snps.biallelic.vcf.gz
</span></span></code></pre></div><h4 id="bcftools-query">
  bcftools query
  <a class="heading-anchor" href="#bcftools-query" data-anchor="bcftools-query" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h4>
<p>Extract tabular fields for reporting.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">bcftools query -f <span class="s1">&#39;%CHROM\t%POS\t%REF\t%ALT\t%QUAL\n&#39;</span> input.vcf.gz &gt; variants.tsv
</span></span></code></pre></div><h4 id="bcftools-norm">
  bcftools norm
  <a class="heading-anchor" href="#bcftools-norm" data-anchor="bcftools-norm" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h4>
<p>Normalize and split multiallelic records.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">bcftools norm <span class="se">\
</span></span></span><span class="line"><span class="cl">  -f reference.fasta <span class="se">\
</span></span></span><span class="line"><span class="cl">  -m -any input.vcf.gz -Oz -o input.norm.vcf.gz
</span></span></code></pre></div><h4 id="bcftools-stats">
  bcftools stats
  <a class="heading-anchor" href="#bcftools-stats" data-anchor="bcftools-stats" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h4>
<p>Generate summary statistics for QC.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">bcftools stats input.vcf.gz &gt; input.vcf.stats.txt
</span></span></code></pre></div><h3 id="tabix">
  tabix
  <a class="heading-anchor" href="#tabix" data-anchor="tabix" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Index bgzipped VCF for region queries.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">tabix -p vcf input.vcf.gz
</span></span></code></pre></div>]]></content>
        </item>
        
        <item>
            <title>GFF3 and GTF File Formats</title>
            <link>/posts/2026/03/gff3-and-gtf-file-formats/</link>
            <pubDate>Tue, 10 Mar 2026 19:00:00 +0000</pubDate>
            
            <guid>/posts/2026/03/gff3-and-gtf-file-formats/</guid>
            <description>&lt;p&gt;GFF3 and GTF are tab-delimited annotation formats used to describe genomic features such as genes, transcripts, and exons.&lt;/p&gt;
&lt;h2 id=&#34;tldr&#34;&gt;
  TL;DR
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#tldr&#34; data-anchor=&#34;tldr&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Both formats use 1-based, closed genomic coordinates.&lt;/li&gt;
&lt;li&gt;Core records are 9 columns: &lt;code&gt;seqid&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;type&lt;/code&gt;, &lt;code&gt;start&lt;/code&gt;, &lt;code&gt;end&lt;/code&gt;, &lt;code&gt;score&lt;/code&gt;, &lt;code&gt;strand&lt;/code&gt;, &lt;code&gt;phase&lt;/code&gt;, &lt;code&gt;attributes&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;GFF3 uses &lt;code&gt;key=value&lt;/code&gt; attributes and explicit parent/child IDs (&lt;code&gt;ID&lt;/code&gt;, &lt;code&gt;Parent&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;GTF commonly uses &lt;code&gt;key &amp;quot;value&amp;quot;;&lt;/code&gt; attributes, especially &lt;code&gt;gene_id&lt;/code&gt; and &lt;code&gt;transcript_id&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Prefer GFF3 for hierarchical feature models and standards compliance; GTF is common in RNA-seq pipelines.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure&#34;&gt;
  Structure
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#structure&#34; data-anchor=&#34;structure&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;Each non-comment line has 9 tab-delimited fields:&lt;/p&gt;</description>
            <content type="html"><![CDATA[<p>GFF3 and GTF are tab-delimited annotation formats used to describe genomic features such as genes, transcripts, and exons.</p>
<h2 id="tldr">
  TL;DR
  <a class="heading-anchor" href="#tldr" data-anchor="tldr" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Both formats use 1-based, closed genomic coordinates.</li>
<li>Core records are 9 columns: <code>seqid</code>, <code>source</code>, <code>type</code>, <code>start</code>, <code>end</code>, <code>score</code>, <code>strand</code>, <code>phase</code>, <code>attributes</code>.</li>
<li>GFF3 uses <code>key=value</code> attributes and explicit parent/child IDs (<code>ID</code>, <code>Parent</code>).</li>
<li>GTF commonly uses <code>key &quot;value&quot;;</code> attributes, especially <code>gene_id</code> and <code>transcript_id</code>.</li>
<li>Prefer GFF3 for hierarchical feature models and standards compliance; GTF is common in RNA-seq pipelines.</li>
</ul>
<h2 id="structure">
  Structure
  <a class="heading-anchor" href="#structure" data-anchor="structure" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<p>Each non-comment line has 9 tab-delimited fields:</p>
<ol>
<li>seqid (chromosome/contig)</li>
<li>source</li>
<li>feature type (for example <code>gene</code>, <code>transcript</code>, <code>exon</code>, <code>CDS</code>)</li>
<li>start (1-based)</li>
<li>end (1-based, inclusive)</li>
<li>score (<code>.</code> if missing)</li>
<li>strand (<code>+</code>, <code>-</code>, or <code>.</code>)</li>
<li>phase (<code>0</code>, <code>1</code>, <code>2</code> for CDS, else <code>.</code>)</li>
<li>attributes</li>
</ol>
<p>GFF3 example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">##gff-version 3
</span></span><span class="line"><span class="cl">chr1	RefSeq	gene	11869	14409	.	+	.	ID=gene1;Name=DDX11L1
</span></span><span class="line"><span class="cl">chr1	RefSeq	mRNA	11869	14409	.	+	.	ID=tx1;Parent=gene1
</span></span><span class="line"><span class="cl">chr1	RefSeq	exon	11869	12227	.	+	.	ID=exon1;Parent=tx1
</span></span></code></pre></div><p>GTF example:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-text" data-lang="text"><span class="line"><span class="cl">chr1	ENSEMBL	gene	11869	14409	.	+	.	gene_id &#34;GENE1&#34;; gene_name &#34;DDX11L1&#34;;
</span></span><span class="line"><span class="cl">chr1	ENSEMBL	transcript	11869	14409	.	+	.	gene_id &#34;GENE1&#34;; transcript_id &#34;TX1&#34;;
</span></span><span class="line"><span class="cl">chr1	ENSEMBL	exon	11869	12227	.	+	.	gene_id &#34;GENE1&#34;; transcript_id &#34;TX1&#34;; exon_number &#34;1&#34;;
</span></span></code></pre></div><h3 id="coordinate-model-important">
  Coordinate model (important)
  <a class="heading-anchor" href="#coordinate-model-important" data-anchor="coordinate-model-important" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>GFF3/GTF are 1-based and end-inclusive.</p>
<ul>
<li>Interval <code>chr1 11869 12227</code> has length <code>12227 - 11869 + 1 = 359</code> bp.</li>
<li>Converting to BED requires coordinate shift: <code>BED_start = start - 1</code>, <code>BED_end = end</code>.</li>
</ul>
<h2 id="practical-conventions">
  Practical Conventions
  <a class="heading-anchor" href="#practical-conventions" data-anchor="practical-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Keep feature hierarchy consistent (<code>gene -&gt; transcript -&gt; exon/CDS</code>).</li>
<li>Use stable feature IDs (especially in GFF3 <code>ID</code> and <code>Parent</code>).</li>
<li>Keep chromosome naming consistent across FASTA/BAM/VCF/BED.</li>
<li>Use tabs only; spaces in attributes should remain inside quoted values.</li>
<li>Sort by chromosome and start when possible for reproducible processing.</li>
</ul>
<h2 id="common-pitfalls">
  Common Pitfalls
  <a class="heading-anchor" href="#common-pitfalls" data-anchor="common-pitfalls" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Mixing 1-based GFF3/GTF coordinates with 0-based BED coordinates.</li>
<li>Broken parent-child relationships (missing <code>Parent</code>/<code>ID</code>, inconsistent transcript IDs).</li>
<li>Invalid or inconsistent attributes that downstream parsers cannot interpret.</li>
<li>Treating GFF3 and GTF attribute syntax as interchangeable.</li>
<li>Incorrect CDS phase values causing translation/frame issues.</li>
</ul>
<h2 id="common-uses">
  Common Uses
  <a class="heading-anchor" href="#common-uses" data-anchor="common-uses" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Gene and transcript annotation</li>
<li>Feature counting and RNA-seq quantification input</li>
<li>Region extraction and annotation conversion (for example to BED)</li>
</ul>
<h2 id="useful-annotation-tools">
  Useful Annotation Tools
  <a class="heading-anchor" href="#useful-annotation-tools" data-anchor="useful-annotation-tools" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<h3 id="gffread">
  gffread
  <a class="heading-anchor" href="#gffread" data-anchor="gffread" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Validate, filter, and convert GFF/GTF annotations.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># basic validation-style parse and export</span>
</span></span><span class="line"><span class="cl">gffread annotations.gff3 -T -o annotations.gtf
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># convert GTF to GFF3</span>
</span></span><span class="line"><span class="cl">gffread annotations.gtf -o annotations.gff3
</span></span></code></pre></div><h3 id="gffcompare">
  gffcompare
  <a class="heading-anchor" href="#gffcompare" data-anchor="gffcompare" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Compare transcript annotations against a reference.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">gffcompare -r reference.gtf -o compare_out query.gtf
</span></span></code></pre></div><h3 id="gffutils">
  gffutils
  <a class="heading-anchor" href="#gffutils" data-anchor="gffutils" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Build/query a feature database from GFF/GTF.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl"><span class="c1"># create sqlite DB from GFF3</span>
</span></span><span class="line"><span class="cl">python -m gffutils.cli create annotations.gff3 --db annotations.db
</span></span></code></pre></div><h3 id="gxf2bed">
  gxf2bed
  <a class="heading-anchor" href="#gxf2bed" data-anchor="gxf2bed" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Convert GFF/GTF annotations to BED intervals.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">gxf2bed annotations.gff3 &gt; annotations.bed
</span></span></code></pre></div>]]></content>
        </item>
        
        <item>
            <title>POD5 File Format</title>
            <link>/posts/2026/03/pod5-file-format/</link>
            <pubDate>Tue, 10 Mar 2026 19:00:00 +0000</pubDate>
            
            <guid>/posts/2026/03/pod5-file-format/</guid>
            <description>&lt;p&gt;POD5 is Oxford Nanopore&amp;rsquo;s container format for raw nanopore signal data and read-level run metadata.&lt;/p&gt;
&lt;h2 id=&#34;tldr&#34;&gt;
  TL;DR
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#tldr&#34; data-anchor=&#34;tldr&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;POD5 stores raw electrical signal traces from ONT sequencing runs.&lt;/li&gt;
&lt;li&gt;It replaces FAST5 in many modern ONT workflows with better performance and simpler access patterns.&lt;/li&gt;
&lt;li&gt;A POD5 file typically includes read IDs, signal chunks, timing/scaling info, and run context metadata.&lt;/li&gt;
&lt;li&gt;Basecalling tools (for example Dorado) use POD5 as direct input.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pod5&lt;/code&gt; CLI tools are used to inspect, subset, and convert POD5 datasets.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure&#34;&gt;
  Structure
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#structure&#34; data-anchor=&#34;structure&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;POD5 is a binary container format (not line-based text like FASTQ/VCF/BED).&lt;/p&gt;</description>
            <content type="html"><![CDATA[<p>POD5 is Oxford Nanopore&rsquo;s container format for raw nanopore signal data and read-level run metadata.</p>
<h2 id="tldr">
  TL;DR
  <a class="heading-anchor" href="#tldr" data-anchor="tldr" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>POD5 stores raw electrical signal traces from ONT sequencing runs.</li>
<li>It replaces FAST5 in many modern ONT workflows with better performance and simpler access patterns.</li>
<li>A POD5 file typically includes read IDs, signal chunks, timing/scaling info, and run context metadata.</li>
<li>Basecalling tools (for example Dorado) use POD5 as direct input.</li>
<li><code>pod5</code> CLI tools are used to inspect, subset, and convert POD5 datasets.</li>
</ul>
<h2 id="structure">
  Structure
  <a class="heading-anchor" href="#structure" data-anchor="structure" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<p>POD5 is a binary container format (not line-based text like FASTQ/VCF/BED).</p>
<p>Conceptually, it stores:</p>
<ol>
<li>Read records (<code>read_id</code> and per-read metadata)</li>
<li>Raw signal arrays (current levels across time)</li>
<li>Calibration/scaling fields (for signal interpretation)</li>
<li>Run/context metadata (flowcell, run identifiers, acquisition details)</li>
</ol>
<p>Unlike FASTQ (basecalled sequence) or BAM (aligned reads), POD5 captures pre-basecalling raw signal data.</p>
<h2 id="practical-conventions">
  Practical Conventions
  <a class="heading-anchor" href="#practical-conventions" data-anchor="practical-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Keep POD5 files immutable once generated to preserve provenance.</li>
<li>Track software/basecaller version alongside POD5 datasets.</li>
<li>Organize files by run and sample metadata for downstream traceability.</li>
<li>Use checksums when moving POD5 across storage systems.</li>
<li>Convert/subset with official tooling rather than ad-hoc binary manipulation.</li>
</ul>
<h2 id="common-pitfalls">
  Common Pitfalls
  <a class="heading-anchor" href="#common-pitfalls" data-anchor="common-pitfalls" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Treating POD5 as if it were sequence-level output (it is signal-level data).</li>
<li>Losing run metadata linkage when splitting files without consistent naming.</li>
<li>Mixing POD5 batches from different chemistry/basecaller expectations without tracking metadata.</li>
<li>Underestimating storage and I/O requirements for raw signal datasets.</li>
<li>Attempting manual parsing without format-aware libraries/tools.</li>
</ul>
<h2 id="common-uses">
  Common Uses
  <a class="heading-anchor" href="#common-uses" data-anchor="common-uses" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Input to ONT basecalling workflows</li>
<li>Modified-base and signal-level analyses</li>
<li>Archival of raw nanopore run data</li>
</ul>
<h2 id="useful-pod5-tools">
  Useful POD5 Tools
  <a class="heading-anchor" href="#useful-pod5-tools" data-anchor="useful-pod5-tools" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<h3 id="pod5-inspect">
  pod5 inspect
  <a class="heading-anchor" href="#pod5-inspect" data-anchor="pod5-inspect" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Inspect summary metadata for POD5 files.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pod5 inspect reads.pod5
</span></span></code></pre></div><h3 id="pod5-view">
  pod5 view
  <a class="heading-anchor" href="#pod5-view" data-anchor="pod5-view" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>View selected records/fields from POD5 datasets.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pod5 view reads.pod5 --ids read_ids.txt
</span></span></code></pre></div><h3 id="pod5-subset">
  pod5 subset
  <a class="heading-anchor" href="#pod5-subset" data-anchor="pod5-subset" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Create a smaller POD5 from selected reads.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">pod5 subset reads.pod5 --ids read_ids.txt --output subset.pod5
</span></span></code></pre></div><h3 id="dorado">
  dorado
  <a class="heading-anchor" href="#dorado" data-anchor="dorado" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Use POD5 directly as basecalling input.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">dorado basecaller hac reads.pod5 &gt; basecalls.bam
</span></span></code></pre></div>]]></content>
        </item>
        
        <item>
            <title>SLOW5 and BLOW5 File Formats</title>
            <link>/posts/2026/03/slow5-and-blow5-file-formats/</link>
            <pubDate>Tue, 10 Mar 2026 19:30:00 +0000</pubDate>
            
            <guid>/posts/2026/03/slow5-and-blow5-file-formats/</guid>
            <description>&lt;p&gt;SLOW5/BLOW5 are formats for storing raw Oxford Nanopore signal data with a focus on efficient parallel access and scalable analysis.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Note: SLOW5/BLOW5 are third-party community formats and are not officially supported by Oxford Nanopore Technologies (ONT).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 id=&#34;tldr&#34;&gt;
  TL;DR
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#tldr&#34; data-anchor=&#34;tldr&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;SLOW5 is a human-readable text format for nanopore raw signal data.&lt;/li&gt;
&lt;li&gt;BLOW5 is the binary compressed form of SLOW5, optimized for performance and storage.&lt;/li&gt;
&lt;li&gt;Both formats are designed to improve I/O efficiency compared with legacy FAST5 workflows.&lt;/li&gt;
&lt;li&gt;Typical records include read identifiers, run metadata, and raw signal vectors.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;slow5tools&lt;/code&gt; is the core toolkit for inspect, conversion, merge/split, and indexing operations.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;structure&#34;&gt;
  Structure
  &lt;a class=&#34;heading-anchor&#34; href=&#34;#structure&#34; data-anchor=&#34;structure&#34; aria-label=&#34;Copy link to this section&#34; title=&#34;Copy section link&#34;&gt;
    &lt;svg class=&#34;heading-anchor-icon&#34; xmlns=&#34;http://www.w3.org/2000/svg&#34; width=&#34;16&#34; height=&#34;16&#34; viewBox=&#34;0 0 24 24&#34; fill=&#34;none&#34; stroke=&#34;currentColor&#34; stroke-width=&#34;2&#34; stroke-linecap=&#34;round&#34; stroke-linejoin=&#34;round&#34; aria-hidden=&#34;true&#34; focusable=&#34;false&#34;&gt;
      &lt;path d=&#34;M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23&#34;&gt;&lt;/path&gt;
      &lt;path d=&#34;M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77&#34;&gt;&lt;/path&gt;
    &lt;/svg&gt;
  &lt;/a&gt;
&lt;/h2&gt;
&lt;p&gt;SLOW5 records are tab-delimited text with:&lt;/p&gt;</description>
            <content type="html"><![CDATA[<p>SLOW5/BLOW5 are formats for storing raw Oxford Nanopore signal data with a focus on efficient parallel access and scalable analysis.</p>
<blockquote>
<p>Note: SLOW5/BLOW5 are third-party community formats and are not officially supported by Oxford Nanopore Technologies (ONT).</p>
</blockquote>
<h2 id="tldr">
  TL;DR
  <a class="heading-anchor" href="#tldr" data-anchor="tldr" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>SLOW5 is a human-readable text format for nanopore raw signal data.</li>
<li>BLOW5 is the binary compressed form of SLOW5, optimized for performance and storage.</li>
<li>Both formats are designed to improve I/O efficiency compared with legacy FAST5 workflows.</li>
<li>Typical records include read identifiers, run metadata, and raw signal vectors.</li>
<li><code>slow5tools</code> is the core toolkit for inspect, conversion, merge/split, and indexing operations.</li>
</ul>
<h2 id="structure">
  Structure
  <a class="heading-anchor" href="#structure" data-anchor="structure" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<p>SLOW5 records are tab-delimited text with:</p>
<ol>
<li>Header/meta lines (global run metadata and field definitions)</li>
<li>One line per read record</li>
<li>Core per-read fields (for example read ID, digitisation/range/offset, sampling rate)</li>
<li>Raw signal values</li>
</ol>
<p>BLOW5 stores the same logical content in binary compressed form.</p>
<p>In practice:</p>
<ul>
<li>Use SLOW5 when readability/interchange in text form is useful.</li>
<li>Use BLOW5 for large-scale processing, lower disk footprint, and faster random access.</li>
</ul>
<h2 id="practical-conventions">
  Practical Conventions
  <a class="heading-anchor" href="#practical-conventions" data-anchor="practical-conventions" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Prefer BLOW5 for production workflows and large datasets.</li>
<li>Keep tool/version metadata with files for reproducibility.</li>
<li>Preserve read-group/run identifiers when converting between formats.</li>
<li>Index BLOW5 files when supported/needed by downstream tooling.</li>
<li>Use official converters (<code>slow5tools</code>) instead of custom parsers for transformations.</li>
</ul>
<h2 id="common-pitfalls">
  Common Pitfalls
  <a class="heading-anchor" href="#common-pitfalls" data-anchor="common-pitfalls" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Confusing text SLOW5 and binary BLOW5 in downstream command usage.</li>
<li>Losing metadata consistency when merging runs from different experiments.</li>
<li>Ignoring version compatibility between tools and file schema.</li>
<li>Treating signal-level files as sequence-level outputs.</li>
<li>Skipping integrity checks after large file transfers.</li>
</ul>
<h2 id="common-uses">
  Common Uses
  <a class="heading-anchor" href="#common-uses" data-anchor="common-uses" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<ul>
<li>Raw nanopore signal storage</li>
<li>High-throughput signal processing workflows</li>
<li>Conversion pipelines between nanopore raw-data formats</li>
</ul>
<h2 id="useful-slow5blow5-tools">
  Useful SLOW5/BLOW5 Tools
  <a class="heading-anchor" href="#useful-slow5blow5-tools" data-anchor="useful-slow5blow5-tools" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h2>
<h3 id="slow5tools-view">
  slow5tools view
  <a class="heading-anchor" href="#slow5tools-view" data-anchor="slow5tools-view" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Inspect SLOW5/BLOW5 content.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">slow5tools view reads.blow5 <span class="p">|</span> head
</span></span></code></pre></div><h3 id="slow5tools-stats">
  slow5tools stats
  <a class="heading-anchor" href="#slow5tools-stats" data-anchor="slow5tools-stats" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Generate quick file/read summary statistics.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">slow5tools stats reads.blow5
</span></span></code></pre></div><h3 id="slow5tools-merge">
  slow5tools merge
  <a class="heading-anchor" href="#slow5tools-merge" data-anchor="slow5tools-merge" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Merge multiple SLOW5/BLOW5 files.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">slow5tools merge run1.blow5 run2.blow5 -o merged.blow5
</span></span></code></pre></div><h3 id="slow5tools-split">
  slow5tools split
  <a class="heading-anchor" href="#slow5tools-split" data-anchor="slow5tools-split" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Split a large file into smaller chunks.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">slow5tools split reads.blow5 -d split_out/
</span></span></code></pre></div><h3 id="slow5tools-index">
  slow5tools index
  <a class="heading-anchor" href="#slow5tools-index" data-anchor="slow5tools-index" aria-label="Copy link to this section" title="Copy section link">
    <svg class="heading-anchor-icon" xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" aria-hidden="true" focusable="false">
      <path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07L11.7 5.23"></path>
      <path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 1 0 7.07 7.07l1.77-1.77"></path>
    </svg>
  </a>
</h3>
<p>Create an index for random access.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">slow5tools index reads.blow5
</span></span></code></pre></div>]]></content>
        </item>
        
        <item>
            <title>Getting Started with Note-Taking in Obsidian</title>
            <link>/posts/2023/07/getting-started-with-note-taking-in-obsidian/</link>
            <pubDate>Sat, 01 Jul 2023 19:00:00 +0000</pubDate>
            
            <guid>/posts/2023/07/getting-started-with-note-taking-in-obsidian/</guid>
            <description>&lt;p&gt;In an age where information is everywhere, note-taking has become an essential
skill. Obsidian is a powerful note-taking app and knowledge base that works
on top of a local folder of plain text Markdown files.&lt;/p&gt;
&lt;div id=&#34;installing-obsidian&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Installing Obsidian&lt;/h2&gt;
&lt;p&gt;First things first, you need to install Obsidian. Visit the &lt;a href=&#34;https://obsidian.md&#34;&gt;Obsidian website&lt;/a&gt; and download the installer for your operating
system. Run the installer, and once the installation is complete, launch
Obsidian.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;creating-your-first-vault&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Creating Your First Vault&lt;/h2&gt;
&lt;p&gt;When you open Obsidian for the first time, it will prompt you to create a new
Vault. A Vault is essentially a folder that contains all your notes. Choose a
location for your Vault, give it a name, and click “Create”.&lt;/p&gt;</description>
            <content type="html"><![CDATA[


<p>In an age where information is everywhere, note-taking has become an essential
skill. Obsidian is a powerful note-taking app and knowledge base that works
on top of a local folder of plain text Markdown files.</p>
<div id="installing-obsidian" class="section level2">
<h2>Installing Obsidian</h2>
<p>First things first, you need to install Obsidian. Visit the <a href="https://obsidian.md">Obsidian website</a> and download the installer for your operating
system. Run the installer, and once the installation is complete, launch
Obsidian.</p>
</div>
<div id="creating-your-first-vault" class="section level2">
<h2>Creating Your First Vault</h2>
<p>When you open Obsidian for the first time, it will prompt you to create a new
Vault. A Vault is essentially a folder that contains all your notes. Choose a
location for your Vault, give it a name, and click “Create”.</p>
</div>
<div id="exploring-the-interface" class="section level2">
<h2>Exploring the Interface</h2>
<p>Obsidian’s interface is clean and intuitive. On the left, you have the file explorer where you can see all your notes and folders. In the center is the editor where you write your notes. On the right, you’ll find the backlinks panel and various other options.</p>
</div>
<div id="creating-and-editing-notes" class="section level2">
<h2>Creating and Editing Notes</h2>
<p>To create a new note, click the “New note” icon in the upper left corner. This creates a new file in your Vault. You can start typing immediately.</p>
<p>Obsidian uses Markdown for formatting, which is a lightweight markup language with plain text formatting syntax. For example, use <code>#</code> for headers, <code>*</code> for bullet lists, and <code>[link text](URL)</code> for hyperlinks.</p>
</div>
<div id="linking-your-thoughts" class="section level2">
<h2>Linking Your Thoughts</h2>
<p>One of the powerful features of Obsidian is the ability to link notes together. To link to another note, type <code>[[</code> followed by the name of the note you want to link to. This helps create a network of related notes, building a connected knowledge base.</p>
</div>
<div id="exploring-your-knowledge-graph" class="section level2">
<h2>Exploring Your Knowledge Graph</h2>
<p>As you create links between notes, Obsidian builds a visual graph of these connections. Click on the “Open graph view” icon on the left sidebar to see your knowledge graph. This graph can be an excellent way to see the big picture and discover unexpected connections between your thoughts.</p>
</div>
<div id="plugins-and-customization" class="section level2">
<h2>Plugins and Customization</h2>
<p>Obsidian supports plugins, which can add new features or modify existing ones. Go to “Settings” &gt; “Third-party plugins” to explore and install plugins. You can also customize the appearance of Obsidian with different themes. Go to “Settings” &gt; “Appearance” to explore and select themes.</p>
</div>
<div id="syncing-and-backing-up-your-notes" class="section level2">
<h2>Syncing and Backing Up Your Notes</h2>
<p>It’s crucial to keep your notes safe. Obsidian stores your notes as plain text files in your Vault, so you can easily back them up or sync them using services like Dropbox or Google Drive.</p>
</div>
<div id="conclusion" class="section level2">
<h2>Conclusion</h2>
<p>Obsidian is a powerful and flexible tool for note-taking and building a personal knowledge base. Its use of Markdown and the ability to create links between notes makes it especially powerful for connecting ideas and information. As you get more comfortable with Obsidian, you’ll find that it can be an invaluable tool for organizing your thoughts and knowledge. Happy note-taking!</p>
</div>
]]></content>
        </item>
        
    </channel>
</rss>
