MicroSeq design: algorithm and architecture choices for Sanger assembly

This document explains why MicroSeq uses a staged assembly strategy for paired Sanger reads and how that compares with common alternatives.

Current strategy (implemented)

For the canonical trigger-to-status matrix and artifact-level routing fields, see Workflow Resolution Funnel -> “Assemble/validate status routing matrix” in docs/workflow_resolution.md.

Fast overlap merge first (merge_two_reads):
- End-anchored, ungapped overlap search across forward/revcomp orientations.
- Candidate ranking by overlap length, mismatch count, overlap-quality tie-breakers.
- Explicit ambiguous overlap detection when top candidates are not uniquely best.
Policy-aware merge decision:
- Terminal hard policy in strict mode (quality_low in blocking).
- High-confidence conflict guardrail (high_conflict_q_threshold, high_conflict_action).
- IUPAC ambiguity for low-confidence quality ties.
CAP3 fallback for non-merged fast-path outcomes (including ambiguous/high_conflict when routed).
Post-CAP3 validation:
- Accept CAP3 contig only when both source reads are represented.
- Mark failures as cap3_unverified and keep singlet fallback.

Why this architecture

Strengths

Speed on clean data: most pairs resolve in the fast path.
Robust rescue on messy tails/indels: CAP3 handles difficult Sanger edge cases.
Auditability: explicit statuses (merged, ambiguous_overlap, high_conflict, quality_low, cap3_unverified) are easier to reason about than opaque sensitivity knobs.
Information retention: IUPAC retains more evidence than blanket N at low-confidence ties.

Tradeoffs

More moving parts than a single monolithic assembler.
Requires clear telemetry and tests to avoid state drift between fast and fallback paths.

Comparison with common approaches

1) Single-pass greedy overlap assembler

Pros

Simple operational model (one engine).
Mature tools exist.

Cons

Pays full computational cost even when easy fast merges dominate.
Harder to expose explicit failure classes for automated QC routing.

2) Strict expected-errors gating before assembly

Pros

Easy global quality filter.

Cons

Length-coupled rejection can be too punitive for long Sanger reads.
May discard salvageable read pairs where overlap-local evidence is strong.

3) Reference-guided assembly

Pros

Can resolve ambiguous de novo overlaps.
Useful for targeted loci with trusted references.

Cons

Introduces reference bias.
Requires curation/versioning of reference sets and may hide novel variation.

When to use reference-guided mode later

Add as an optional branch when:

overlap ambiguity is frequent,
a validated locus reference panel exists,
or downstream analysis tolerates reference bias.

Keep de novo staged assembly as the default for broad, transparent Sanger workflows.