MicroSeq design: algorithm and architecture choices for Sanger assembly
MicroSeq design: algorithm and architecture choices for Sanger assembly
This document explains why MicroSeq uses a staged assembly strategy for paired Sanger reads and how that compares with common alternatives.
Current strategy (implemented)
For the canonical trigger-to-status matrix and artifact-level routing fields, see Workflow Resolution Funnel -> “Assemble/validate status routing matrix” in docs/workflow_resolution.md.
- Fast overlap merge first (
merge_two_reads):- End-anchored, ungapped overlap search across forward/revcomp orientations.
- Candidate ranking by overlap length, mismatch count, overlap-quality tie-breakers.
- Explicit ambiguous overlap detection when top candidates are not uniquely best.
- Policy-aware merge decision:
- Terminal hard policy in strict mode (
quality_lowinblocking). - High-confidence conflict guardrail (
high_conflict_q_threshold,high_conflict_action). - IUPAC ambiguity for low-confidence quality ties.
- Terminal hard policy in strict mode (
- CAP3 fallback for non-merged fast-path outcomes (including ambiguous/high_conflict when routed).
- Post-CAP3 validation:
- Accept CAP3 contig only when both source reads are represented.
- Mark failures as
cap3_unverifiedand keep singlet fallback.
Why this architecture
Strengths
- Speed on clean data: most pairs resolve in the fast path.
- Robust rescue on messy tails/indels: CAP3 handles difficult Sanger edge cases.
- Auditability: explicit statuses (
merged,ambiguous_overlap,high_conflict,quality_low,cap3_unverified) are easier to reason about than opaque sensitivity knobs. - Information retention: IUPAC retains more evidence than blanket
Nat low-confidence ties.
Tradeoffs
- More moving parts than a single monolithic assembler.
- Requires clear telemetry and tests to avoid state drift between fast and fallback paths.
Comparison with common approaches
1) Single-pass greedy overlap assembler
Pros
- Simple operational model (one engine).
- Mature tools exist.
Cons
- Pays full computational cost even when easy fast merges dominate.
- Harder to expose explicit failure classes for automated QC routing.
2) Strict expected-errors gating before assembly
Pros
- Easy global quality filter.
Cons
- Length-coupled rejection can be too punitive for long Sanger reads.
- May discard salvageable read pairs where overlap-local evidence is strong.
3) Reference-guided assembly
Pros
- Can resolve ambiguous de novo overlaps.
- Useful for targeted loci with trusted references.
Cons
- Introduces reference bias.
- Requires curation/versioning of reference sets and may hide novel variation.
When to use reference-guided mode later
Add as an optional branch when:
- overlap ambiguity is frequent,
- a validated locus reference panel exists,
- or downstream analysis tolerates reference bias.
Keep de novo staged assembly as the default for broad, transparent Sanger workflows.