Workflow Resolution Funnel
Updated MicroSeq workflow mental model
MicroSeq now follows a 4-stage funnel so the default user view stays simple while preserving audit-grade details.
For a consolidated file-by-file output table across QC/assembly/BLAST/postblast, see Output Artifacts Reference.
1) Ingest (Trace QC)
MicroSeq computes AB1 trace-QC metrics (signal/noise, mixed-signal proxies, vendor tags when present).
Threshold flags are optional (--trace-qc-flags or trace_qc.enable_flags).
Mixture escalation is config-driven:
trace_qc.enable_mixture_inferencetrace_qc.mixture_suspect_threshold
SNR modes (trace_qc.snr_mode)
MicroSeq supports two SNR estimators for AB1 trace QC:
global_mad- Signal: median primary peak height in the QC core window.
- Noise: baseline MAD from non-peak regions in the same core window.
- Use when: you want a conservative, waveform-level background-noise indicator that is easy to compare across files.
basecall_aware- For each called base near
PLOC, signal is the local max in the called base channel and noise is a statistic over non-called channels. - Per-base ratios are aggregated (median) over the QC core window.
- Use when: you want SNR to track called-base separability/readability more closely (often better aligned with human chromatogram review).
- For each called base near
Practical guidance:
- Use
basecall_awareas the default decision mode for PASS/WARN/FAIL in typical Sanger workflows. - Keep
global_madavailable for diagnostics/troubleshooting as a conservative background-noise check. - Only one mode is active per run; compare both by rerunning QC with each mode and reviewing
qc/trim_summary.tsvandAB1_TRACE_DIAGlogs. - Use mode-specific thresholds (
snr_warn_global_mad/snr_fail_global_madvssnr_warn_basecall_aware/snr_fail_basecall_aware) because the scales differ. - AB1 diagnostics log both SNR estimators (
SNR_global_madandSNR_basecall_aware) and the selected mode value used for flagging. - Low-SNR behavior with missing quality evidence is policy-driven via
low_snr_missing_quality_policy(warn by default, optional hard-fail floor withsnr_fail_hard). - Decision policy: canonical mode decides FAIL; disagreement between estimators escalates to non-blocking WARN (
SNR_MODE_DISCORDANT_WARN), and secondary mode never downgrades a canonical FAIL to PASS.
2) Assemble (Structural hypotheses)
MicroSeq generates multiple hypotheses only when there is a structural tie (for example, ambiguous overlap outcomes). If there is only one structural path, the sample is structurally unambiguous.
Assemble → Validate handoff: how ambiguous_overlap is decided
Candidate generation in 3 stages
Stage 1 Generate candidate overlaps (the “sliding placements” step)
Stage 1 Generate candidates: MicroSeq tries multiple orientations (R as-is vs revcomp(R)) and multiple relative placements (sliding the reverse read along the forward read). Each placement defines an overlap region. For that overlap, MicroSeq computes overlap length, mismatches, identity, and (when QUAL is available) an overlap-quality score. Candidates that do not connect the reads at their ends are rejected by the end-anchoring rule (anchor_tolerance_bases allows small end drift due to trimming).
Definition: a candidate is one proposed end-anchored overlap between F and (R or revcomp(R)), characterized by relative placement (and, for gapped backends, an alignment path), from which overlap_len, mismatches, identity, and overlap_quality are computed.
- It evaluates both orientations for the reverse read (
forwardandrevcomp). - Ungapped intuition: slide one read across the other (
... left, center, right ...), where each placement implies a different overlap slice. - Gapped backends may represent differences with an indel path, then anchoring is checked using end-anchored placement rules.
Important: this is not “use only a 30 bp overlap.”
anchor_tolerance_basesis a forgiveness margin for end placement (how close overlap must be to read ends after trimming), not the overlap length itself. Overlaps can still be much longer (for example 100+ bp).
Stage 2 Feasibility filtering (hard gates)
From all generated candidates, MicroSeq keeps only feasible candidates:
overlap_len >= min_overlapidentity >= min_identity- quality gate only when
quality_mode=blocking:overlap_qualitymust exist and be>= min_quality
- end-anchored requirement (
end_anchored=True)
Under current repo defaults (min_overlap=100, min_identity=0.8, min_quality=20.0,
quality_mode=warning), quality is advisory by default, while overlap length/identity and
end-anchoring remain hard feasibility checks.
Stage 3 Final decision (unique best vs ambiguous_overlap)
Feasible candidates are ranked deterministically (longer overlap first, then fewer mismatches, then overlap quality, then identity), and ambiguity is evaluated only on the top-2 candidates.
Tie rule for ambiguous_overlap:
- top-1 and top-2 have the same
overlap_len - top-1 and top-2 have the same
mismatches - and either:
- both have quality and
|q1 - q2| <= ambiguity_quality_epsilon, or - quality is unavailable and
|id1 - id2| <= ambiguity_identity_delta
- both have quality and
If all hold, selector returns status=ambiguous_overlap; otherwise top-1 is the unique winner.
Working examples: ungapped sliding vs gapped alignment backends
Legend: | = match, x = mismatch, - = gap (indel introduced by aligner).
These diagrams illustrate candidate types; exact mismatch/score values can vary by backend and scoring settings.
A) Ungapped, end-anchored sliding candidates (placement-driven)
For ungapped candidates, the generator effectively slides one sequence relative to the other and scores each valid end-anchored placement. For a given orientation and placement, you get one ungapped candidate.
Toy example (repetitive tail so nearby placements can look similarly good):
F = ATATATATATGGGGRrc = ATATATATCCCCCC
Candidate A (placement 1: Rrc starts aligned at the beginning of F):
F_overlap = ATATATATATR_overlap = ATATATATCC
F: ATATATATATGGGG
Rrc: ATATATATCCCCCC
||||||||xx
Candidate B (placement 2: Rrc starts one base later in F):
F_overlap = TATATATATGR_overlap = ATATATATCC
F_overlap: TATATATATG
R_overlap: ATATATATCC
x|||||||xx
In repetitive sequence, A and B can be near-tied in identity/mismatches/quality, which is exactly how ambiguous overlap candidates arise in Stage 1 before tie-resolution in Stage 3.
B) Gapped backend candidates (Biopython/edlib): indels are allowed
Gapped backends can explain insertion/deletion differences with gaps, instead of forcing a mismatch cascade.
Example:
F = ATGCCCTTAGRrc = ATGCCTTAG(one base shorter aroundCrun)
Ungapped intuition (no indels allowed):
In ungapped mode, length differences manifest as forced mismatches/edge penalties (the trailing -
below is only to keep the visualization aligned):
F: ATGCCCTTAG
Rrc: ATGCCTTAG-
||||x||||-
Gapped alignment can model this as one indel event:
F: ATGCCCTTAG
Rrc: ATG-CCTTAG
||| ||||||
This can produce a stronger candidate than any ungapped placement for the same read pair.
C) Repeats/homopolymers can create near-equivalent gapped interpretations
In low-complexity sequence, an indel can often be represented in nearby positions with similar scores. Conceptually:
Path 1: A AAAAGGGTT
- AAAAGGGTT
Path 2: AAAA AGGGTT
AAA- AGGGTT
Both describe the same underlying deletion in a homopolymer context.
Implementation note: in practice MicroSeq typically emits one best alignment per backend per orientation (and sometimes a small bounded set), rather than enumerating a combinatorial set of all optimal paths. So many practical near-ties come from relative-placement/orientation near-ties, while gapped backends still explain why mismatch/indel trade-offs can differ from ungapped sliding.
D) Why end-anchoring still matters
End-anchoring constrains relative placement (overlap must reach the expected stitching ends within tolerance so reads can connect into one contiguous sequence). Within that constraint:
- ungapped mode explores placement shifts,
- gapped mode can change mismatch/indel trade-offs and effective overlap metrics.
So candidate sets can differ by backend even for the same sample and orientation.
Plain-language end-anchoring rule:
- Accept when overlap reaches the expected stitching ends (within
anchor_tolerance_bases) so the reads connect into one contiguous sequence. - Reject when a high-identity block is internal-only and does not connect those stitching ends.
ACCEPT (end-anchored):
F: AAAAACCCCCGGGGG
|||||||
Rrc: CCCCCGGTTTTT
overlap reaches the expected stitching end (or is within tolerance)
REJECT (internal-only match):
F: AAAAACCCCCGGGGGTTTTT
|||||||
Rrc: XXYYCCCCCGGZZWW
good local match, but does not connect stitching ends
Worked example (ambiguous)
Use the same feasible-candidate context as above (passes Stage 2 gates):
- Candidate A:
overlap_len=120,mismatches=2,overlap_quality=34.20,identity=0.9833 - Candidate B:
overlap_len=120,mismatches=2,overlap_quality=34.16,identity=0.9833
Configured ambiguity thresholds:
ambiguity_quality_epsilon = 0.10ambiguity_identity_delta = 0.0025
Checks:
- same length? yes (
120 == 120) - same mismatches? yes (
2 == 2) - quality near-tie? yes (
|34.20 - 34.16| = 0.04 <= 0.10)
Result: ambiguous_overlap.
Counterexample (not ambiguous)
- A:
overlap_len=120,mismatches=2 - B:
overlap_len=119,mismatches=2
Different overlap length breaks the ambiguity rule immediately, so A is selected as the unique best candidate.
What ambiguous policies do (merge_two_reads)
After ambiguous_overlap, policy controls output behavior:
strict: keep ambiguous outcome; no forced merged sequence output.singlets: emit singlets (ambiguous_overlap_singlets).best_guess: force top-1 candidate and emit one consensus (merged_best_guess).topk: emitalt1..altKconsensus sequences (ambiguous_topk), whereK=ambiguous_top_k(bounded by available feasible candidates).
When topk is used, each emitted alternative is tracked as a structural branch through
hypothesis_map (qseqid -> structural_hypothesis_id). Those branches are then independently
validated by BLAST/taxonomy and collapsed into sample-level resolution state.
3) Validate (BLAST + taxonomy)
MicroSeq runs BLAST/taxonomy against sequences sent to BLAST and ranks best hit per hypothesis deterministically:
bitscore(desc)pident(desc)qcovhsp(desc)evalue(asc)qseqid(asc tie-break)
Taxonomy agreement can be evaluated at a configured rank (species by default) using parsed lineage tokens (k__, p__, c__, o__, f__, g__, s__).
4) Resolve (sample-level state)
At sample level, MicroSeq assigns:
unambiguousresolved_by_evidenceneeds_review
And stores:
resolution_stateresolved_hypothesisresolution_reason
These fields are emitted to review/summary TSV outputs so the UI can stay quiet by default.
What changed for users (practical impact)
The latest contract updates are mainly about predictability and traceability:
hypothesis_mapnow has one meaning everywhere- It maps
qseqid -> structural_hypothesis_idonly. - This prevents provenance IDs from being mixed into structural decision logic.
- It maps
source_id_mapwas added for provenance- It maps
qseqid -> original source sequence id. - You can now audit where each BLAST input came from without overloading hypothesis logic.
- It maps
- Sequence-output size and structural ambiguity are now separated
payload_entity_ntracks how many concrete sequence entities were emitted.structural_hypothesis_ntracks decision branches.- Result: a sample can have multiple sequence records and still be structurally unambiguous.
- Multi-entity sequence outputs are advisory by default
- Non-
contig_altmulti-entity sequence outputs addmulti_payloadtowarning_flags. - This is non-blocking unless other safety/review conditions escalate.
- Non-
- Rank-aware missing taxonomy is explicit
- If hits exist but no usable label can be extracted at the configured rank, resolution is
rank_missing. - This is more informative than folding these cases into generic ambiguity.
- If hits exist but no usable label can be extracted at the configured rank, resolution is
- Review vs advisory is explicit
review_action+review_reasongovern queue inclusion.advisory_reasonsummarizes non-blocking signals for UI/triage.
Decision table
| Condition | resolution_state | resolution_reason |
|---|---|---|
| Single structural hypothesis and all required evidence present | unambiguous |
single_hypothesis |
| Multiple structural hypotheses, all validated, same taxonomy label at configured rank, no blocking safety flags | resolved_by_evidence |
hypotheses_agree_<rank> |
| Taxonomy disagreement across validated hypotheses | needs_review |
ambiguous_taxonomy |
| Structural hypotheses > hypotheses with hits | needs_review |
partial_hits |
Trace QC FAIL (sticky) |
needs_review |
trace_fail |
| Safety escalation (e.g. high conflict) | needs_review |
safety flag value |
No-hit and missing-sequence policy
MicroSeq drives review queue population from blast-input contract rows, then joins taxonomy evidence when present.
| Contract/evidence condition | review behavior |
|---|---|
blast_payload=pair_missing |
needs_review, reason pair_missing |
blast_payload=no_payload |
needs_review, reason no_payload |
| structural hypotheses exist but zero taxonomy hits | needs_review, reason no_hits |
| taxonomy file unavailable / not parseable | needs_review, reason taxonomy_missing |
Trace escalation rules (paired samples)
Sample-level trace status is computed from F/R statuses using:
FAIL > WARN > PASS > NA
trace_fail=> sticky safety escalation +needs_reviewtrace_warn=> non-blocking advisory inwarning_flags/advisory_reason- mixture inference (when enabled) can set
review_reason=mixture_suspected
UI behavior
- Default view: show
unambiguous+resolved_by_evidence - Expert view: include
needs_reviewand expandable hypotheses
This keeps output Geneious-like for routine samples while preserving full provenance.
Assemble/validate status routing matrix
| Trigger condition | Status label emitted (merged, ambiguous_overlap, high_conflict, quality_low, cap3_unverified) |
IUPAC involvement (yes/no, only after a single structural path is selected) |
Routing/next action | Primary artifact field(s) to inspect (merge_status, status, hypothesis_map, review_reason, warning_flags) |
|---|---|---|---|---|
| One unique top feasible overlap candidate (passes overlap + anchoring gates) | merged |
no (only after a single structural path is selected) | Emit merged sequence output; continue to validate/rank hits on that single path | merge_status; sample status |
| Top-1 and top-2 feasible candidates are tie-equivalent under ambiguity thresholds | ambiguous_overlap |
no (only after a single structural path is selected) | Branch hypotheses (best_guess/topk) or hold as strict ambiguity for review routing |
merge_status; hypothesis_map; review_reason |
| Overlap exists but disagreement burden exceeds configured high-confidence conflict guardrail | high_conflict |
no (only after a single structural path is selected) | Route by configured conflict action (typically CAP3 fallback or explicit review escalation) | merge_status; sample status; warning_flags; review_reason |
Candidate is structurally feasible but blocked by quality policy (quality_mode=blocking) |
quality_low |
no (only after a single structural path is selected) | Do not accept fast merge; route to CAP3 fallback or review path per policy | merge_status; sample status; warning_flags |
| CAP3 fallback output fails verification contract (for example, missing source-read representation) | cap3_unverified |
no (only after a single structural path is selected) | Keep singlet/no-usable-sequence fallback and escalate to review if required | sample status; review_reason; warning_flags |
Why this method instead of manual IUPAC consensus curation
Older Sanger workflows often resolve ambiguous overlap or mixed-base positions by creating a single IUPAC consensus, then manually curating traces in a GUI tool. That approach is useful for expert review, but it has trade-offs for reproducible, batch-scale pipelines.
MicroSeq intentionally uses a hypothesis + evidence resolution method first, with manual review as a targeted fallback:
- Preserves uncertainty explicitly
- Instead of collapsing early into one IUPAC sequence, MicroSeq keeps structural alternatives as hypotheses.
- This avoids hiding meaningful disagreement before validation.
- Deterministic at scale
- Resolution uses a fixed ranking tuple and contract fields, so repeated runs are explainable and reproducible.
- Manual consensus editing can vary by operator and session.
- Separates routine from exceptional cases
- When hypotheses agree taxonomically and safety checks pass, MicroSeq resolves automatically (
resolved_by_evidence). - Only the small subset with conflicts/no-hit/trace-fail is escalated to
needs_review.
- When hypotheses agree taxonomically and safety checks pass, MicroSeq resolves automatically (
- Better auditability for regulated/shared labs
- Review queue, reasons, warnings, and structural-vs-hit counts are emitted as machine-readable artifacts.
- Manual curation can still be done, but now it is focused and documented per sample.
- Still compatible with manual trace review
- This design does not prohibit classic chromatogram curation.
- It reduces manual burden by routing analysts only to samples where evidence actually disagrees or quality/safety thresholds fail.
In short: IUPAC/manual consensus remains a valuable expert tool, but MicroSeq’s default model is optimized for reproducibility, throughput, and explicit decision provenance.
Schema contracts
asm/blast_inputs.tsv
Required columns:
sample_id,blast_payload,payload_ids,reasonpayload_kind,payload_n,payload_entity_n,payload_max_lenambiguity_flag,safety_flag,decision_sourcereview_reason,warning_flagsstructural_hypothesis_n,hypotheses_with_hits_n,missing_hits_nhypothesis_map(qseqid=structural_hypothesispairs)source_id_map(qseqid=original_source_idpairs)
Interpretation notes:
payload_n/payload_entity_nare sequence-record counts, not structural branch counts.structural_hypothesis_nis the number of structural alternatives considered by resolution logic.warning_flagscontains all non-blocking warnings (semicolon-separated, deduplicated, sorted).
asm/assembly_summary.tsv
In addition to assembly/reporting columns, resolution fields are synchronized post-taxonomy:
resolution_state,resolved_hypothesis,resolution_reasonreview_action,review_reason,advisory_reason,warning_flagsstructural_hypothesis_n,hypotheses_with_hits_n,missing_hits_ntrace_status,trace_status_f,trace_status_r,trace_flags
qc/review_queue.tsv
Primary output contract:
sample_idreview_action(renamed from status to avoid cross-table ambiguity)review_reason,advisory_reasonwarning_flagsstructural_hypothesis_n,hypotheses_with_hits_n,missing_hits_ntop_labelsresolution_state,resolved_hypothesis,resolution_reasontrace_status,trace_flags
Behavior contract:
review_action=queueonly for actionable review cases.review_reasonis populated only whenreview_action=queue.advisory_reasonsummarizes one prioritized non-blocking signal.
Worked example (happy path)
If sample hypotheses (hyp1, hyp2, hyp3) all hit the same species with near-identical stats:
- Structural ambiguity is retained internally,
- taxonomy agreement collapses the sample to one resolved result,
resolution_state=resolved_by_evidence,- no manual review unless safety/trace flags escalate.
This is the intended “hidden complexity, simple default output” behavior.