Skip to main content

cfdna lengths

Count fragment lengths in a BAM-file.

Writes a two-dimensional .npy count matrix. The first dimension contains count vectors: one global vector, one vector per window, or one vector per grouped-BED group depending on the selected windowing mode. The second dimension contains fragment length bins.

Fragment length definition

Paired-end: end(reverse) - start(forward).

Unpaired where each read is a fragment: end(read) - start(read).

See also --indel-mode and --clip-mode for adjusting the length to the present indels and soft clips. When enabled, fragment length filtering is based on the adjusted length.

GC correction

Weight the contribution of each fragment based on their GC contents.

The length-dimension of the original correction matrix is averaged out over --gc-length-range with a specifiable weighting scheme (see --gc-length-weighting).

The GC percentage is calculated from the aligned reference span. It does not consider --indel-mode or --clip-mode.

Genomic smoothing (--scaling-factors)

Weight how genomic regions contribute to the length distribution(s), e.g., to reduce the influence of copy number alterations. This weights the contribution of each fragment by region-wise precomputed scaling factors.

Can be precomputed with cfdna fragment-count-weights (recommended) or cfdna coverage-weights.

Window assignment

By default, fragments are counted by their window-overlap fraction. That is, most fragments are counted as 1.0 (before correction/scaling), while fragments overlapping the edge of a window are counted as the fraction it overlaps the window (< 1.0).

For consecutive non-overlapping windows, this conserves the total mass, as an edge-overlapping fragment will count f in one window and 1-f in the other window.

With the default width-1 length bins, you can convert counts to base-weighted counts (i.e., the coverage in the window) by multiplying each column by the fragment length it represents. Remember to account for the minimum fragment length offset.

Other options include counting the full fragment if the fragment midpoint or a given proportion of positions overlaps the window.

Blacklisting

Ignores fragments that overlap blacklisted regions with a given proportion.

Always-on exclusion criteria

The following criteria always exclude a read:

The read is secondary, supplementary or duplicate. The read failed quality check.

Paired-end input only: The read or mate read is unmapped. The read is mapped to a different tid than the mate. The paired reads are not inwardly directed (we require: start(forward) <= start(reverse)).


Usage

cfdna lengths [OPTIONS] --bam <BAM> --output-dir <OUTPUT_DIR>

Options

  • -h, --help

    Print help (see a summary with '-h')

Core

  • -i, --bam <BAM>

    Indexed, coordinate-sorted BAM input file [path]

    Can be either paired-end or unpaired (set --reads-are-fragments). Unpaired assumes the reads span their fragments exactly (so read size is fragment size).

  • -o, --output-dir <OUTPUT_DIR>

    Output directory for results [path]

  • -t, --n-threads <N_THREADS>

    Number of threads to use (increases RAM usage) [integer]

    Defaults to the number of available CPU cores (-1).

    [default: auto]

  • --reads-are-fragments

    The input has one read per fragment and the read spans the full aligned fragment (e.g. Nanopore) [flag]

    Each aligned read is treated as a fragment spanning its aligned reference interval [pos, reference_end). Some commands allow expanding this to include soft clipped bases.

    Cannot be combined with --require-proper-pair (when available).

  • -x, --output-prefix <OUTPUT_PREFIX>

    Optional prefix for output files (e.g., a sample name) [string]

    Leave empty to write filenames without a leading prefix.

    E.g., specify to enable writing to the same output directory from multiple calls to this software.

    Examples produce files like: <prefix>.length_counts.npy

  • --tile-size <TILE_SIZE>

    Size of tiles to parallelize over [integer]

    Chromosomes are processed in tiles of this size to reduce memory usage.

    [default: 20000000]

  • --length-bins <LENGTH_BINS>...

    Edges of fragment length bins to count in [string(s)]

    This also defines the minimum and maximum included fragment lengths.

    Bins are half-open. For example, --length-bins 10 151 221 creates bins [10,151) and [151,221).

    Accepted forms:

    • A single value with start:end:step: Creates contiguous bins from start to end in step increments. Example: The default 30:1001:1 creates one column per length from 30 through 1000.

    • Multiple integer values interpreted as bin edges: Example: --length-bins 30 80 151 221 creates bins [30,80), [80,151), and [151,221).

    NOTE: Memory consumption increases linearly with the number of bins.

    [default: 30:1001:1]

Indels and clipping

  • --indel-mode <INDEL_MODE>

    How to handle insertions and deletions in fragments [string]

    Deletions: Both 'D' and 'N' in the cigar string are considered deletions.

    Possible values:

    • "ignore": Ignore whether indels are present or not.

    Lengths are calculated from the reference coordinates end(reverse) - start(forward).

    • "adjust": Adjust the reference length by the observed insertions and deletions (we cannot adjust in the mate-gap).

    For bases only covered by a single read, all insertions and deletions are adjusted for.

    In the mate-overlap, only adjust when both reads show the indel at the same reference position.

    Deletions: subtract the reference bases deleted in both reads.

    Insertions: add the shortest insertion length per position.

    NOTE: Blacklist exclusion, GC correction, and calculation of scaling weights (--scaling-factors) use the aligned reference span.

    • "skip": Skip fragments with any insertion or deletion present.

    [default: ignore]

  • --max-deletion-bases <MAX_DELETION_BASES>

    Skip fragments with more deleted reference bases than this when using --indel-mode adjust [integer]

    Both D and N CIGAR operations count as deletion bases.

    NOTE: This cap is only used with --indel-mode adjust.

    [default: 100]

  • --clip-mode <CLIP_MODE>

    How to handle soft clipping in fragment ends [string]

    When you believe soft clipping in the fragment ends is mostly due to alignment difficulties instead of technical artefacts, you can include the clipped bases in the fragment length.

    Possible values:

    • "aligned": Ignore clipped bases and use the aligned positions.

    • "adjust": Adjust the fragment length by the observed soft clipped bases in the fragment ends.

    For paired-end data, the clipping is only considered for the 5' ends (start(forward), end(reverse)).

    NOTE: Blacklist exclusion, GC correction, and scaling weights (--scaling-factors) use the aligned reference span. When --assign-by count-overlap, clipped-only window contributions use the nearest aligned reference base for scaling.

    • "skip": Skip fragments with any clipping.

    [default: aligned]

  • --max-soft-clips <MAX_SOFT_CLIPS>

    Skip fragments where one or both ends have more soft-clipped bases than this [integer]

    Use --clip-mode skip to discard all soft-clipped fragments.

    [default: 256]

Windows (select max. one arg.)

  • --by-size <BY_SIZE>

    Window definition: a fixed window size [integer]

    When no windowing is specified, the default is one global window.

  • --by-bed <BY_BED>

    Window definition: a BED file of windows [path]

  • --by-grouped-bed <BY_GROUPED_BED>

    Window definition: a BED file of grouped windows [path]

    Requires a fourth BED column with the group name.

    Windows with the same group name are aggregated together in the final output. The exact per-group output shape depends on the command.

Window Assignment

  • --assign-by <ASSIGN_BY>

    The fragment positions that should overlap a window for it to be counted in that window, OR the option to count the fraction of overlapping bases [string]

    Possible values: "count-overlap", "any", "all", "midpoint", or "proportion=<threshold>"

    'count-overlap': Count up the fraction of overlapping fragment bases.

    Example of proportion: --assign-by proportion=0.2 (no space around =)

    Midpoints for even-sized fragments use a deterministic coordinate-derived random seed to select either the left or right base. Duplicate fragments with the same coordinates get the same choice. This avoids fixed rounding bias while keeping repeated runs reproducible.

    NOTE: In the rare case where windows are smaller than fragments, it's still the proportion of the fragment positions that overlap that is considered. If the window size is 30% of the fragment size, that fragment cannot overlap more than 30%.

    NOTE: Ignored when no windows are specified.

    [default: count-overlap]

Chromosome Selection (select max. one arg.)

  • --chromosomes <CHROMOSOMES>...

    Names of chromosomes to process (comma-separated or repeated). E.g. 'chr1,chr2,chr3'.

    When no chromosomes are specified, it defaults to chr1..chr22.

    Specify "all" as the only string to use all chromosomes from the command's configured contig source.

  • --chromosomes-file <CHROMOSOMES_FILE>

    File with chromosome names to process (one per line)

Normalization

  • --scaling-factors <SCALING_FACTORS>

    Optional path to non-negative scaling factors for normalizing/smoothing the genome [path]

    .tsv file as produced by cfdna fragment-count-weights or cfdna coverage-weights containing a scaling factor to multiply by per scaling-bin.

    Files may start with comment metadata lines from cfdna coverage-weights/fragment-count-weights, such as # gc_mode=corrected_tag.

    The scaling-bin-overlapping parts of the fragments are counted as the scaling factor of the bin.

    File Requirements

    The TSV file must have a header. Column names are matched case-insensitively.

    Required columns: chromosome, start, end, scaling_factor.

    Coordinates are 0-based, half-open [start, end).

    Scaling factors must be finite and non-negative.

    Bins are filtered to the provided chromosomes.

    For every chromosome in chromosomes, bins must:

    • start at the 0-coordinate

    • be perfectly contiguous (no gaps, no overlaps)

    • end exactly at that chromosome's length

Filtering

  • --min-mapq <MIN_MAPQ>

    Minimum mapping quality to include [integer]

    [default: 30]

  • --require-proper-pair

    Only count properly paired reads [flag]

    This is NOT recommended by default as it trims the tails of the length distribution.

    Note, that we only keep inward-directed fragments within the specified length range, so there's no real need for proper-pair filtering.

  • -b, --blacklist <BLACKLIST>...

    Optional BED file(s) with blacklisted regions [path]

  • --blacklist-min-size <BLACKLIST_MIN_SIZE>

    Minimum size of blacklist intervals to load (bp) [integer]

    [default: 1]

  • --blacklist-strategy <BLACKLIST_STRATEGY>

    The fragment positions that should overlap blacklisted regions for it to be excluded [string]

    Possible values: "any", "all", "midpoint", or "proportion=<threshold>"

    Example of proportion: --blacklist-strategy proportion=0.2 (no space around =)

    [default: any]

GC Correction

  • --gc-file <GC_FILE>

    Optional path to GC correction file made from the same BAM file with cfdna gc-bias [path]

    The file is usually called gc_bias_correction.npz.

    NOTE: Requires specifying the reference genome 2bit file as well.

  • --neutralize-invalid-gc

    Keep fragments with unusable GC weights and weight them as 1.0 [flag]

    By default, fragments are skipped when the GC correction cannot be computed or resolves to an unusable value. Set this flag to keep them instead and count them with neutral weight 1.0.

  • --gc-length-weighting <GC_LENGTH_WEIGHTING>

    How to weight the GC-package fragment length bins when estimating the length-agnostic GC correction [string]

    The default GC correction package stores a fragment length x GC matrix with one normalized GC correction curve per fragment length bin. NOTE: These fragment length bins are independent of those specified on --length-bins and are referred to as GC length bins.

    When estimating the fragment length distribution itself, using those normalized correction curves directly would just preserve the original length distribution (when applied to the same fragments seen by cfdna gc-bias).

    We therefore average out the fragment length dimension to get a single, length-agnostic GC correction curve.

    First, --gc-length-range selects which GC length bins to use. Then, --gc-length-trim-rare can exclude a fraction of the selected bins with the lowest frequencies in the length distribution used to build the GC correction package. Finally, --gc-length-weighting controls how the retained correction curves are collapsed.

    We have three weighting options:

    • "equal" weighting (default): Weight every GC length bin the same.

    Keeps the correction independent of the length distribution we are trying to estimate.

    Downside: Rare GC length bins contribute the same as the most common GC length bins.

    For low-coverage BAM files, this could make the correction more volatile to outliers. To reduce this effect, --gc-length-trim-rare allows filtering out a fraction of the rarest bins.

    • "frequency" weighting: Weight GC length bins by their frequencies in the length distribution used to build the GC package.

    This makes the collapsed curve represent the fragments most often seen in the package.

    Downside: Biases the correction based on the length distribution we are trying to estimate. This is circular: the length signal is partly used to correct itself.

    • "max-frequency" weighting: Use the GC correction curve for the GC length bin with highest frequency in the length distribution used to build the GC package.

    [default: equal]

  • --gc-length-range <GC_LENGTH_RANGE>

    Which GC-package fragment length bins to use when averaging out the GC correction matrix [string]

    The GC correction package stores one GC correction curve per fragment length bin (separate from --length-bins and referred to as GC length bins). This option controls which GC length bins --gc-length-weighting collapses to a single length-agnostic GC curve.

    Possible values:

    • "requested": Use GC length bins that overlap the range requested by --length-bins.

    • "package": Use all GC length bins in the GC correction package. Useful for repeated calls with different length ranges that you wish to compare downstream.

    [default: requested]

  • --gc-length-trim-rare <GC_LENGTH_TRIM_RARE>

    Exclude low-frequency selected GC length bins up to this cumulative fraction [float]

    After selecting the GC length bins with --gc-length-range, this excludes the least frequent GC length bins while keeping at least a 1 - fraction total frequency fraction.

    Conservative: GC length bins with practically the same frequency are grouped, and the whole group is excluded only if the retained sum of frequencies is at least 1 - fraction of the selected total.

    Use this with --gc-length-weighting equal when almost-unobserved GC length bins make the length-agnostic GC correction too noisy.

    [default: 0.0]

  • -r, --ref-2bit <REF_2BIT>

    2bit reference genome file [path]

    NOTE: Required when specifying --gc-file.

    E.g., "hg38.2bit" from UCSC (https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit).

Logging

  • --log <LOG>

    Logging destination [stdout|quiet|file|file=<path>]

    stdout keeps the normal run narrative on standard output.

    quiet suppresses the normal run narrative and progress bars, while warnings and errors still go to stderr.

    file writes the normal run narrative to an auto-generated log file under the command output directory.

    file=<path> writes the normal run narrative to the exact path you provide.

    [default: stdout]