Paired CAP3 assembly
Use this page to understand what paired mode means biologically and how MicroSeq decides what sequence, if any, should move forward to BLAST.
For command examples and wrapper usage, see CLI Workflows. For button-by-button GUI use, see GUI Walkthrough. For the canonical artifact table, see Output Artifacts Reference. For ambiguity routing and sample-level resolution states, see Workflow Resolution Funnel. For algorithm and design tradeoffs, see MicroSeq design: algorithm and architecture choices for Sanger assembly.
Scope and cross-links
This page is the semantic reference for paired forward/reverse assembly in MicroSeq. It focuses on the biological meaning of pairing, contigs, singlets, overlap failure, and BLAST handoff. It does not try to teach the wrapper flow or the GUI.
In microbiology terms, paired mode starts from one practical question: do these forward and reverse reads from the same sample support one defensible consensus sequence, or do they only support partial evidence?
Paired-mode semantic model
In paired mode, MicroSeq normalizes the input reads, groups files into forward/reverse pairs, runs paired assembly logic, and then chooses the sequence output that will represent that sample downstream.
At the sample level, the main outcomes are:
| Outcome | What it means biologically | What MicroSeq does downstream |
|---|---|---|
contig |
Forward and reverse reads support one merged consensus sequence. | Sends the contig to BLAST. |
singlet |
A valid pair existed, but assembly did not yield a defended merged contig; at least one standalone sequence remains usable. | Sends singlet sequence(s) to BLAST according to handoff policy. |
no_payload |
The sample reached assembly, but no usable sequence output survived for downstream search. | Records the sample in the manifest; sends no sequence to BLAST. |
pair_missing |
Only one orientation survived pairing detection or well enforcement, so no true paired assembly was possible. | Records the sample in the manifest; sends no sequence to BLAST. |
For microbiology workflows, the key practical idea is this: paired mode is not just “run CAP3.” It is a structural decision layer that asks whether two directional reads support a single sample-level sequence interpretation.
Pairing rules: tokens/primer label, wells, duplicate policy
MicroSeq pairs files deterministically from filenames. It strips known forward/reverse tokens/primer labels from the name, resolves a sample ID, and optionally checks plate well codes. This favors reproducibility over guesswork.
Core pairing rules
| Concept | Rule | Why it matters |
|---|---|---|
| Sample ID | Forward and reverse files pair only if the remaining sample ID matches after token/primer-label stripping. | Prevents accidental cross-sample pairing. |
| Token/primer-label position | Primer labels can appear at the front, middle, or end of the filename. | Lets labs keep existing naming conventions. |
| Well codes | Wells matter only when well enforcement is enabled. | Useful for plate exports where cross-well swaps are risky. |
| Duplicate policy | Controls what happens when multiple forwards or reverses map to the same sample. | Makes duplicate handling explicit instead of silent. |
Filename examples
These pair successfully because the sample ID resolves to KD001 in both files:
KD001_27F.ab1↔KD001_1492R.ab127F_KD001_A01.ab1↔1492R_KD001_A01.ab1KD001_A01_27F.ab1↔KD001_A01_1492R.ab1
These do not pair because the resolved sample IDs differ:
KD001_27F.ab1↔KD004_1492R.ab1
These only pair when well enforcement is off:
KD001_A01_27F.ab1↔KD001_B01_1492R.ab1
What duplicate policy means
| Policy | Meaning |
|---|---|
error |
Stop and force the duplicate to be resolved explicitly. |
keep-first |
Keep the first matching file per orientation. |
keep-last |
Keep the last matching file per orientation. |
merge |
Merge duplicate same-orientation inputs before downstream paired assembly. |
keep-separate |
Preserve duplicate branches as separate paired hypotheses. |
For most routine isolate work, error is the safest default because it prevents
silent mixing of repeated reactions or mislabeled files.
Assembly -> BLAST handoff semantics
This is the most important downstream contract in paired mode:
asm/blast_inputs.fasta is what BLAST actually saw, and asm/blast_inputs.tsv
is the audit trail explaining why each sequence was selected.
Main handoff artifacts
| Artifact | Meaning |
|---|---|
asm/blast_inputs.fasta |
Final sequence records sent to BLAST. |
asm/blast_inputs.tsv |
Per-sample manifest linking sequence choice to assembly outcome. |
Contig -> singlet fallback order
For each paired sample, MicroSeq chooses the first available sequence output in this order:
- CAP3 contigs (
*.cap.contigs) - CAP3 singlets (
*.cap.singlets) - No usable sequence output
FASTA header rewrite
BLAST query IDs are rewritten so the sample and payload type remain visible:
sampleA|contig|cap3_c1
sampleA|singlet|cap3_s1
blast_inputs.tsv columns
| Column | Meaning |
|---|---|
sample_id |
Paired sample key. In keep-separate mode this can include branch suffixes such as _1, _2, and so on. |
blast_payload |
One of contig, singlet, no_payload, or pair_missing. |
payload_ids |
Mapping from rewritten BLAST IDs back to original CAP3 or read IDs. |
reason |
Why that output was chosen. |
blast_payload and reason taxonomy
blast_payload |
reason |
Interpretation |
|---|---|---|
contig |
contigs_present |
A merged consensus sequence was available and used. |
singlet |
singlets_only |
No contig survived, but at least one standalone sequence remained usable. |
no_payload |
cap3_no_output |
Assembly produced no usable sequence output for BLAST. |
pair_missing |
pair_missing |
No valid forward/reverse pair existed after pairing rules were applied. |
How pair_missing is assigned
MicroSeq uses qc/pairing_report.tsv together with the staged paired inputs.
If a sample has only one surviving orientation after token/primer-label detection
and optional well enforcement, it is marked as pair_missing. The sample remains visible in
asm/blast_inputs.tsv, but no sequence is sent to BLAST.
For microbiology interpretation, this matters because pair_missing is not a
taxonomic failure. It is a structural or input-state failure: the sample never had
a valid paired assembly opportunity.
CAP3 diagnostics and overlap-failure interpretation
When a sample does not produce a contig, the main question is whether the problem came from the reads themselves, from overlap geometry, or from an assembly gate that was too strict. The three most useful places to look are:
asm/assembly_summary.tsvqc/overlap_audit.tsvasm/<sample>/*.cap.info
Common paired outcomes
| Status or outcome | Practical meaning | First place to look |
|---|---|---|
assembled |
A contig was produced and selected. | asm/assembly_summary.tsv |
singlets_only |
The sample had usable sequence evidence, but not a defended merged contig. | *.cap.singlets, qc/overlap_audit.tsv |
pair_missing |
One direction was absent after pairing logic. | qc/pairing_report.tsv |
cap3_no_output |
CAP3 did not produce usable contigs or singlets for handoff. | *.cap.info |
ambiguous_overlap |
More than one plausible overlap candidate was near-tied, so MicroSeq refused to force a single merge. | qc/overlap_audit.tsv, workflow_resolution.md |
quality_low |
Overlap existed, but quality evidence did not support a safe merge. | qc/overlap_audit.tsv, QUAL files |
CAP3 overlap-removal gates
CAP3 overlap acceptance is a conjunction of multiple filters, so it helps to know which gate actually failed before relaxing parameters.
- Clipping range failures (
-y): CAP3 can miss a real overlap if clipping removed too much useful end sequence. - Difference score (
-b/-d): Too many quality-weighted mismatches can cause CAP3 to reject the overlap. - Difference cap (
-e): CAP3 rejects overlaps with more differences than the allowed mismatch tolerance. - Similarity score (
-s): Low quality-weighted similarity can reject an overlap even when some sequence similarity exists. - Hard gates (
-o,-p): Minimum overlap length and minimum percent identity still must be met.
Why QUAL propagation matters
QUAL is required for correct CAP3 scoring.
Without .qual files, CAP3 treats every base as Q=10, which distorts clipping,
mismatch penalties, and similarity scoring. For Sanger data, that can change
whether two reads are judged mergeable at all.
Audit-driven relaxation strategy
Use the audit to target the gate that failed instead of relaxing everything at once:
- If overlap length is just below threshold, adjust overlap length first.
- If the overlap looks real but clipping removed too much end sequence, revisit clipping before lowering identity.
- If overlap length and identity look acceptable but CAP3 still rejects the merge, inspect quality-weighted gates before broad relaxation.
Validation criteria for rescued contigs
If you rescue a borderline case by relaxing a gate, the strongest validation pattern is:
- forward and reverse singlet top hits still agree taxonomically,
- the rescued contig BLAST result agrees with both singlets,
- and the overlap region does not show implausible clusters of disagreement.
That keeps the rescue biologically interpretable instead of merely computationally permissive.
Short FAQ on pairing rules
How can I tell whether files were auto-paired correctly?
Use qc/pairing_report.tsv.
MicroSeq pairs files by resolved sample ID after stripping forward/reverse tokens/primer labels,
with wells participating only when well enforcement is enabled.
What does pair_missing mean in practice?
It means one orientation was missing after pairing logic. It does not mean BLAST failed. It means the sample never reached true paired assembly.
What is a singlet?
A singlet is a standalone sequence record that survived assembly processing when a defended merged contig was not produced. In practice, it means the sample still has usable sequence evidence, but the forward/reverse pair did not support one clean consensus contig.
When should I care about well enforcement?
Use it when filenames come from plate-based exports and cross-well mismatches are a real risk. Leave it off when sample IDs are already unique and wells are not biologically meaningful.
Related docs
- CLI Workflows for wrapper commands and headless runs
- GUI Walkthrough for button-by-button GUI usage
- Output Artifacts Reference for the canonical file table
- Workflow Resolution Funnel for ambiguity routing and sample-level resolution states
- MicroSeq design: algorithm and architecture choices for Sanger assembly for algorithm tradeoffs