cfdna fragment-count-weights

Extract fragment count-based smoothing weights in large genomic bins ("megabins") with a rolling window and calculate normalizing scaling factors for smoothing the genome.

Use this when you want long and short fragments to contribute more equally to the large-scale smoothing profile. In contrast, regular coverage-based weights (see cfdna coverage-weights) count long fragments higher simply because they cover more bases.

Outputs scaling factors per stride to allow other methods to apply the normalization. Files are written as:

<prefix>.fragment_counts.scaling_factors.tsv

Multipliers: After normalization of the non-zero smoothed fragment counts to a global mean of 1.0, the values are inverted to multiplicative scaling factors.

Fragment counts

Internally, this command runs:

fcoverage --normalize-by-length=unit-mass --by-size <stride> --per-window total

and then smooths those stride-bin totals.

The resulting stride-bin values are fractional fragment counts in each stride bin. Each fragment contributes 1.0 in total. If it crosses a stride-bin boundary, that contribution is split between the bins as the fractional overlap of each bin.

Fragment span definition

Paired-end: [forward.pos, reverse.reference_end), the reference span from the first aligned position on the forward read to the last aligned position on the reverse read.

Unpaired where each read is a fragment: [read.pos, read.reference_end), the reference span from the first to the last aligned position on the read.

GC correction

When downstream tools should use both genomic smoothing and GC-bias correction, supply --gc-file or --gc-tag here too. The command then uses corrected fragment counts, which avoids over-correction downstream when the genomic smoothing factors partly reflect large-scale GC bias.

The written TSV records whether GC correction was used so downstream commands can check whether the two transformations are used together consistently or not.

Smoothing

Smoothing is performed as a triangular moving average, calculating a weighted average of fragment counts from all bins overlapping a stride.

Example

Assuming a bin-size of 6 and stride size of 2 (normally defaults to 5Mb and 0.5Mb respectively).

Stride bins (fixed along genome, each with a fragment count):

[A] [B] [C] [D] [E] [F] [G] ...

Overlapping megabins (MB*) (each covers 3 stride-bins). W_D, the number of overlapping megabins, is the (unnormalized) weight of each stride-bin in the weighted-average fragment count for stride-bin D:

MB1: [A][B][C]

MB2:    [B][C][D]

MB3:       [C][D][E]

MB4:          [D][E][F]

MB5:             [E][F][G]

W_D: [0][1][2][3][2][1][0]

At chromosome edges, the weights are truncated (e.g., W_D: [2][3][2][1][0]).

The stride bins are further weighted by their number of eligible bases (non-blacklisted positions). This also handles the often shorter final stride bin per chromosome.

The weights are normalized by their sum (after potential truncation at edges).

Fully blacklisted stride bins are skipped while smoothing neighboring bins. They may still get a finite smoothed value from neighboring support, but their scaling factor is written as 0.

Always-on exclusion criteria

The following criteria always exclude a read:

The read is secondary, supplementary or duplicate. The read failed quality check.

Paired-end input only: The read or mate read is unmapped. The read is mapped to a different tid than the mate. The paired reads are not inwardly directed (we require: start(forward) <= start(reverse)).

Usage

cfdna fragment-count-weights [OPTIONS] --bam <BAM> --output-dir <OUTPUT_DIR>

Options

-h, --help

Print help (see a summary with '-h')

Core

-i, --bam <BAM>

Indexed, coordinate-sorted BAM input file [path]

Can be either paired-end or unpaired (set --reads-are-fragments). Unpaired assumes the reads span their fragments exactly (so read size is fragment size).
-o, --output-dir <OUTPUT_DIR>

Output directory for results [path]
-t, --n-threads <N_THREADS>

Number of threads to use (increases RAM usage) [integer]

Defaults to the number of available CPU cores (-1).

[default: auto]
--reads-are-fragments

The input has one read per fragment and the read spans the full aligned fragment (e.g. Nanopore) [flag]

Each aligned read is treated as a fragment spanning its aligned reference interval [pos, reference_end). Some commands allow expanding this to include soft clipped bases.

Cannot be combined with --require-proper-pair (when available).
-x, --output-prefix <OUTPUT_PREFIX>

Optional prefix for output files (e.g., a sample name) [string]

Leave empty to write filenames without a leading prefix.

E.g., specify to enable writing to the same output directory from multiple calls to this software.
--tile-size <TILE_SIZE>

Size (bp) of tiles to parallelize over [integer]

Chromosomes are processed in tiles of this size to reduce memory usage.

[default: 10000000]
--stride <STRIDE>

Size (bp) of the stride bins [integer]

NOTE: --bin-size must be divisible by stride. I.e., --bin-size % stride == 0`.

A multiplicative scaling factor is calculated per stride bin based on all its overlapping large-scale bins (--bin-size).

Setting smaller values leads to a higher precision in the downstream normalization but also requires saving a larger TSV (one line per stride-bin) and takes longer to compute.

[default: 500000]
--bin-size <BIN_SIZE>

Size (bp) of the large genomic bins used to build the triangularly weighted average [integer]

Each stride bin is smoothed based on all large genomic bins that overlap it. Larger values lead to more smoothing across the genome as each stride bin is overlapped by more large bins from a broader region.

[default: 5000000]

Chromosome Selection (select max. one arg.)

--chromosomes <CHROMOSOMES>...

Names of chromosomes to process (comma-separated or repeated). E.g. 'chr1,chr2,chr3'.

When no chromosomes are specified, it defaults to chr1..chr22.

Specify "all" as the only string to use all chromosomes from the command's configured contig source.
--chromosomes-file <CHROMOSOMES_FILE>

File with chromosome names to process (one per line)

Filtering

--min-fragment-length <MIN_FRAGMENT_LENGTH>

Minimum fragment length to include [integer]

[default: 30]
--max-fragment-length <MAX_FRAGMENT_LENGTH>

Maximum fragment length to include [integer]

[default: 1000]
--min-mapq <MIN_MAPQ>

Minimum mapping quality to include [integer]

[default: 30]
--require-proper-pair

Only count properly paired reads [flag]

This is NOT recommended by default, as it trims the tails of the length distribution.

Note, that we only keep inward-directed fragments within the specified length range, so there's no real need for proper-pair filtering.
-b, --blacklist <BLACKLIST>...

Optional BED file(s) with blacklisted regions [path]

GC Correction (select max. one source)

--gc-file <GC_FILE>

Optional path to GC correction file made from the same BAM file with cfdna gc-bias [path]

The file is usually called gc_bias_correction.npz.

NOTE: Requires specifying the reference genome 2bit file as well.
--gc-tag <GC_TAG>

Optional aux tag to get GC weight from when using external GC correction packages [string]

The tag name must be exactly two ASCII characters matching the SAM/BAM AUX tag format: first character is a letter, second character is a letter or digit.

Packages like GCParagon and GCfix allow saving GC weights directly to the reads in a BAM file. They often assign a "GC" aux tag.

The average per-read weight is used to count the fragment. When any of the reads have a zero-weight, the fragment gets a zero-weight. If only one mate has a usable tag, that single usable weight is reused for the fragment.
--neutralize-invalid-gc

Keep fragments with unusable GC weights and weight them as 1.0 [flag]

By default, fragments are skipped when the GC correction is missing, cannot be computed, or resolves to an unusable value. Set this flag to keep them instead and count them with neutral weight 1.0.

GC Correction

-r, --ref-2bit <REF_2BIT>

Optional 2bit reference genome file [path]

NOTE: Required for --gc-file, otherwise ignored.

E.g., "hg38.2bit" from UCSC ( https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit ).

Logging

--log <LOG>

Logging destination [stdout|quiet|file|file=<path>]

stdout keeps the normal run narrative on standard output.

quiet suppresses the normal run narrative and progress bars, while warnings and errors still go to stderr.

file writes the normal run narrative to an auto-generated log file under the command output directory.

file=<path> writes the normal run narrative to the exact path you provide.

[default: stdout]

Fragment counts​

Fragment span definition​

GC correction​

Smoothing​

Example​

Always-on exclusion criteria​

Usage​

Options​

Core​

Chromosome Selection (select max. one arg.)​

Filtering​

GC Correction (select max. one source)​

GC Correction​

Logging​