Skip to main content

cfdna ends

Count fragment end- and breakpoint-motifs in a BAM-file.

For each fragment end, it extracts the --k-outside bases just outside the fragment and the --k-inside bases just inside the fragment. For the right fragment end, these are reverse-complemented together. Finally, they are combined to the reference 5'->3'-oriented "<outside>_<inside>" motif.

Visualization of counting

The following shows the counting for aligned fragment ends:

For --k-inside 2 --k-outside 2:

Reference 5' >>>>>>>>>>>>>>> 3'
ATCGTTTTTTTCATC
Fragment --|---------|--
Forward 5' |>>>>>>>| 3'
Outside AT
Inside CG
Reverse 3' |<<<<<<<<| 5'
Inside CA
Outside TC

Reverse (CATC) is reverse complemented to GATG

Counts (<outside>_<inside>): AT_CG: 1, GA_TG: 1

Output files

Writes either:

  • a dense .npy matrix with shape (# windows, # motifs) when --all-motifs is enabled

  • or a sparse .npz matrix otherwise

along with a text file with the matching motif labels.

Motif labels are saved as <outside>_<inside>.

GC correction

Weight the contribution of each fragment based on their GC contents per fragment length.

Genomic smoothing (--scaling-factors)

Weight how genomic regions contribute to the count distribution(s), e.g., to reduce the influence of copy number alterations (if that is meaningful to your analysis). This weights the contribution of each fragment by region-wise precomputed scaling factors.

Can be precomputed with cfdna fragment-count-weights (recommended) or cfdna coverage-weights.

Window assignment

By default, a motif is counted in the window the fragment end falls in with the weight 1.0 (before correction/scaling).

With --clip-strategy include-at-shifted-boundary, that endpoint can move outside the aligned span by the soft-clipped length. GC correction and scaling weights still use the aligned reference span.

With --clip-strategy include-at-aligned-boundary, the inside motif includes soft-clipped read bases, but the endpoint assignment stays at the aligned boundary.

Alternatively, we can weight the motif by how much the fragment overlaps the window or we can count both end motifs of a fragment if the fragment midpoint or a given proportion of positions overlaps the window.

Blacklisting

  1. Skips fragments that overlap blacklisted regions with a given proportion.

  2. Skips motifs overlapping blacklisted regions.

Fragment-level blacklist filtering uses the same assignment coordinates as the selected clip strategy. With --clip-strategy include-at-shifted-boundary, soft-clipped boundary shifts can therefore make a fragment overlap blacklisted regions outside its aligned span.

With --clip-strategy include-at-aligned-boundary, motif-level blacklist validation only checks the part of the inside motif that still overlaps reference coordinates.

Always-on exclusion criteria

The following criteria always exclude a read:

The read is secondary, supplementary or duplicate. The read failed quality check.

Paired-end input only: The read or mate read is unmapped. The read is mapped to a different tid than the mate. The paired reads are not inwardly directed (we require: start(forward) <= start(reverse)).


Usage

cfdna ends [OPTIONS] --bam <BAM> --output-dir <OUTPUT_DIR> --k-inside <K_INSIDE> --k-outside <K_OUTSIDE>

Options

  • -h, --help

    Print help (see a summary with '-h')

Core

  • -i, --bam <BAM>

    Indexed, coordinate-sorted BAM input file [path]

    Can be either paired-end or unpaired (set --reads-are-fragments). Unpaired assumes the reads span their fragments exactly (so read size is fragment size).

  • -o, --output-dir <OUTPUT_DIR>

    Output directory for results [path]

  • -t, --n-threads <N_THREADS>

    Number of threads to use (increases RAM usage) [integer]

    Defaults to the number of available CPU cores (-1).

    [default: auto]

  • --reads-are-fragments

    The input has one read per fragment and the read spans the full aligned fragment (e.g. Nanopore) [flag]

    Each aligned read is treated as a fragment spanning its aligned reference interval [pos, reference_end). Some commands allow expanding this to include soft clipped bases.

    Cannot be combined with --require-proper-pair (when available).

  • -x, --output-prefix <OUTPUT_PREFIX>

    Optional prefix for output files (e.g., a sample name) [string]

    Leave empty to write filenames without a leading prefix.

    E.g., specify to enable writing to the same output directory from multiple calls to this software.

    Examples produce files like: <prefix>.end_motifs.npy <prefix>.end_motifs.sparse.npz

  • -r, --ref-2bit <REF_2BIT>

    2bit reference genome file [path]

    NOTE: Required when using reference bases, blacklist filtering, or specifying --gc-file.

    E.g., "hg38.2bit" from UCSC (https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit).

  • --tile-size <TILE_SIZE>

    Size of tiles to parallelize over [integer]

    Chromosomes are processed in tiles of this size to reduce memory usage.

    [default: 20000000]

Motifs

  • --k-inside <K_INSIDE>

    Number of bases to use from inside the fragment [integer]

  • --k-outside <K_OUTSIDE>

    Number of bases to use from outside the fragment [integer]

  • --source-inside <SOURCE_INSIDE>

    Whether to get the inside-fragment bases from the read or the reference [string]

    Possible values:

    • "read": Use the read sequence for bases inside the fragment.

    • "reference": Use the reference genome for bases inside the fragment.

    [default: read]

  • --all-motifs

    Include every possible motif in the output, even if its count is zero [flag]

Clipping

  • --clip-strategy <CLIP_STRATEGY>

    How to extract a motif when its fragment end is clipped [string]

    Clipping means the read contains terminal bases that the aligner did not align normally. The choice here is thus what positions to count when that happens.

    For extraction of outside bases, we suggest skipping fragments with soft clipping, as it is difficult to infer where on the reference genome the actual fragment end was. We do provide two include-at-boundary modes for this, but neither is perfect.

    NOTE: Fragments with hard-clipping are always discarded.

    Possible values:

    • "skip": Skip motifs when their fragment end is soft-clipped.

    • "aligned": Use the aligned start and end positions (the usual cfDNAlab fragment definition). This ignores clipped bases in the read sequences.

    NOTE: If the aligner clipped the actual DNA molecule, these motifs may not reflect the actual fragment ends.

    • "include-at-aligned-boundary": Include soft-clipped read bases, but keep the aligned fragment-end genomic boundary for outside-base lookup, window assignment, and motif-level blacklist validation.

    This setting is only supported with --source-inside read.

    • "include-at-shifted-boundary": Include soft-clipped read bases, and move the fragment-end boundary outside the aligned span by the clipped length.

    This shifted boundary is used for outside-base lookup, window assignment, and blacklist filtering.

    File-based GC correction and scaling-factor weighting still use the aligned reference span. If the aligned length falls outside the GC package range, the fragment is considered invalid and is included in the GC correction failure statistics. When --assign-by count-overlap, clipped-only window contributions use the nearest aligned reference base for scaling.

    [default: skip]

  • --max-soft-clips <MAX_SOFT_CLIPS>

    Skip motifs whose relevant end has more soft-clipped bases than this [integer]

    This limit is applied independently to each fragment end.

    Fragment length filtering is applied after soft clip expansion.

    Use --clip-strategy skip to discard all soft-clipped motifs.

    [default: 256]

Filtering

  • --indel-filter <INDEL_FILTER>

    When to filter motifs due to indels.

    Deletions: Both 'D' and 'N' in the cigar string are considered deletions.

    Possible values:

    • "auto": Select the option based on the source.

    For read-sequence bases, allow indels in the alignment.

    For reference bases, skip motifs with indels in the alignment.

    • "skip-affected-end": Always skip motifs overlapping indels.

    • "skip-affected-fragment": Skip fragments when either end overlap indels.

    [default: auto]

  • --min-fragment-length <MIN_FRAGMENT_LENGTH>

    Minimum fragment length to include [integer]

    [default: 30]

  • --max-fragment-length <MAX_FRAGMENT_LENGTH>

    Maximum fragment length to include [integer]

    [default: 1000]

  • --min-mapq <MIN_MAPQ>

    Minimum mapping quality to include [integer]

    [default: 30]

  • --bq-filter <BQ_FILTER>

    Base-quality filter on the inside read bases [string]

    Filter either the whole fragment or individual ends based on the base qualities in the inside read bases of the motifs.

    Repeat --bq-filter to count only ends that pass all end filters and belong to fragments that pass all fragment filters.

    Examples:

    • --bq-filter "min in end >= 30" (for "all bases have decent quality")

    • --bq-filter "mean in fragment >= 30" (for "average bases have decent quality")

    • --bq-filter "max in fragment < 20" (for "no bases have decent quality")

    Each expression must use:

    • <agg> in <scope> <op> <threshold>

    With the following values:

    • with <agg> in min, max, or mean

    • with <scope> in end or fragment

    • with <op> in >=, >, <=, or <

    The keywords are parsed case-insensitively and ASCII whitespace is ignored.

    Scope semantics:

    • end: Score each fragment end independently and drop only the failing end.

    • fragment: Score the fragment from its two end scores and drop the full fragment when it fails.

    NOTE: --bq-filter requires --k-inside > 0 and --source-inside read.

  • --require-proper-pair

    Only count properly paired reads [flag]

    This is NOT recommended by default as it trims the tails of the length distribution.

    Note, that we only keep inward-directed fragments within the specified length range, so there's no real need for proper-pair filtering.

  • -b, --blacklist <BLACKLIST>...

    Optional BED file(s) with blacklisted regions [path]

  • --blacklist-min-size <BLACKLIST_MIN_SIZE>

    Minimum size of blacklist intervals to load (bp) [integer]

    [default: 1]

  • --blacklist-strategy <BLACKLIST_STRATEGY>

    The fragment positions that should overlap blacklisted regions for it to be excluded [string]

    NOTE: Motifs overlapping blacklisted regions are always skipped. This strategy is for further filtering of the full fragments. This is useful when you generally don't trust the reference sequences in blacklisted regions.

    Possible values: "any", "all", "midpoint", or "proportion=<threshold>"

    Example of proportion: --blacklist-strategy proportion=0.2 (no space around =)

    [default: any]

Windows (select max. one arg.)

  • --by-size <BY_SIZE>

    Window definition: a fixed window size [integer]

    When no windowing is specified, the default is one global window.

  • --by-bed <BY_BED>

    Window definition: a BED file of windows [path]

  • --by-grouped-bed <BY_GROUPED_BED>

    Window definition: a BED file of grouped windows [path]

    Requires a fourth BED column with the group name.

    Windows with the same group name are aggregated together in the final output. The exact per-group output shape depends on the command.

Window Assignment

  • --assign-by <ASSIGN_BY>

    When to assign motifs to windows [string]

    The default "endpoint" option assigns each motif separately by its own fragment-end position.

    The other modes ask which windows the fragment contributes to, and the fragment's motif(s) are then counted in those window(s).

    Possible values: "endpoint", "count-overlap", "any", "all", "midpoint", or "proportion=<threshold>"

    "endpoint": Count each motif in the windows overlapping its fragment-end position. The two fragment ends may be counted in separate windows.

    "count-overlap": Count up the fraction of fragment bases overlapping each window.

    "any", "all", or "proportion=<threshold>": Assign motifs when a proportion of fragment bases overlap a window.

    Example of proportion: --assign-by proportion=0.2 (no space around =)

    "midpoint": Assign motifs when the fragment midpoint overlaps a window.

    Midpoints for even-sized fragments use a deterministic coordinate-derived random seed to select either the left or right base. Duplicate fragments with the same coordinates get the same choice. This avoids fixed rounding bias while keeping repeated runs reproducible.

    NOTE: In the rare case where windows are smaller than fragments, it's still the proportion of the fragment positions that overlap that is considered. If the window size is 30% of the fragment size, that fragment cannot overlap more than 30%.

    NOTE: Ignored when no windows are specified.

    [default: endpoint]

Chromosome Selection (select max. one arg.)

  • --chromosomes <CHROMOSOMES>...

    Names of chromosomes to process (comma-separated or repeated). E.g. 'chr1,chr2,chr3'.

    When no chromosomes are specified, it defaults to chr1..chr22.

    Specify "all" as the only string to use all chromosomes from the command's configured contig source.

  • --chromosomes-file <CHROMOSOMES_FILE>

    File with chromosome names to process (one per line)

Normalization

  • --scaling-factors <SCALING_FACTORS>

    Optional path to non-negative scaling factors for normalizing/smoothing the genome [path]

    .tsv file as produced by cfdna fragment-count-weights or cfdna coverage-weights containing a scaling factor to multiply by per scaling-bin.

    Files may start with comment metadata lines from cfdna coverage-weights/fragment-count-weights, such as # gc_mode=corrected_tag.

    The scaling-bin-overlapping parts of the fragments are counted as the scaling factor of the bin.

    File Requirements

    The TSV file must have a header. Column names are matched case-insensitively.

    Required columns: chromosome, start, end, scaling_factor.

    Coordinates are 0-based, half-open [start, end).

    Scaling factors must be finite and non-negative.

    Bins are filtered to the provided chromosomes.

    For every chromosome in chromosomes, bins must:

    • start at the 0-coordinate

    • be perfectly contiguous (no gaps, no overlaps)

    • end exactly at that chromosome's length

GC Correction (select max. one source)

  • --gc-file <GC_FILE>

    Optional path to GC correction file made from the same BAM file with cfdna gc-bias [path]

    The file is usually called gc_bias_correction.npz.

    NOTE: Requires specifying the reference genome 2bit file as well.

  • --gc-tag <GC_TAG>

    Optional aux tag to get GC weight from when using external GC correction packages [string]

    The tag name must be exactly two ASCII characters matching the SAM/BAM AUX tag format: first character is a letter, second character is a letter or digit.

    Packages like GCParagon and GCfix allow saving GC weights directly to the reads in a BAM file. They often assign a "GC" aux tag.

    The average per-read weight is used to count the fragment. When any of the reads have a zero-weight, the fragment gets a zero-weight. If only one mate has a usable tag, that single usable weight is reused for the fragment.

  • --neutralize-invalid-gc

    Keep fragments with unusable GC weights and weight them as 1.0 [flag]

    By default, fragments are skipped when the GC correction is missing, cannot be computed, or resolves to an unusable value. Set this flag to keep them instead and count them with neutral weight 1.0.

Logging

  • --log <LOG>

    Logging destination [stdout|quiet|file|file=<path>]

    stdout keeps the normal run narrative on standard output.

    quiet suppresses the normal run narrative and progress bars, while warnings and errors still go to stderr.

    file writes the normal run narrative to an auto-generated log file under the command output directory.

    file=<path> writes the normal run narrative to the exact path you provide.

    [default: stdout]