cfdna midpoints
Count positional fragment midpoint coverage in groups of genomic windows.
Midpoints: The center of the fragment span, with ties (in even-sized fragments) randomly and reproducibly assigned to the left or right mid-position to avoid bias from always rounding in the same direction.
Groups: The coverage profiles are collapsed (summed per position) across windows in a group. E.g., transcription factors as groups with binding sites as windows, yielding the overall midpoint profile per transcription factor.
Smoothing: Final profiles are smoothed with an order-3 Savitzky-Golay filter unless --smoothing none is used.
Strandedness: When the --intervals carry strand information (+/-/.), reverse-stranded (-) intervals write into group profiles in reverse order.
Fragment span definition
Paired-end: [forward.pos, reverse.reference_end), the reference span from the first aligned position on the forward read to the last aligned position on the reverse read.
Unpaired where each read is a fragment: [read.pos, read.reference_end), the reference span from the first to the last aligned position on the read.
The utilized fragment length range is specified via --length-bins.
Always-on exclusion criteria
The following criteria always exclude a read:
The read is secondary, supplementary or duplicate. The read failed quality check.
Paired-end input only: The read or mate read is unmapped. The read is mapped to a different tid than the mate. The paired reads are not inwardly directed (we require: start(forward) <= start(reverse)).
Usage
cfdna midpoints [OPTIONS] --bam <BAM> --output-dir <OUTPUT_DIR> --intervals <INTERVALS>
Options
-
-h, --helpPrint help (see a summary with '-h')
Core
-
-i, --bam <BAM>Indexed, coordinate-sorted BAM input file
[path]Can be either paired-end or unpaired (set
--reads-are-fragments). Unpaired assumes the reads span their fragments exactly (so read size is fragment size). -
-o, --output-dir <OUTPUT_DIR>Output directory for results
[path] -
-t, --n-threads <N_THREADS>Number of threads to use (increases RAM usage)
[integer]Defaults to the number of available CPU cores (-1).
[default: auto]
-
--reads-are-fragmentsThe input has one read per fragment and the read spans the full aligned fragment (e.g. Nanopore)
[flag]Each aligned read is treated as a fragment spanning its aligned reference interval
[pos, reference_end). Some commands allow expanding this to include soft clipped bases.Cannot be combined with
--require-proper-pair(when available). -
-x, --output-prefix <OUTPUT_PREFIX>Optional prefix for output files (e.g., a sample name)
[string]Leave empty to write filenames without a leading prefix.
E.g., specify to enable writing to the same output directory from multiple calls of the command.
Examples produce files like:
<prefix>.midpoint_profiles.npy. -
-w, --intervals <INTERVALS>The grouped fixed-size intervals to count within
[path]A BED-like file of genomic intervals, their respective group names, and optionally the interval strandedness.
Must be sorted by the
chromosomeandstartcoordinates, and all intervals must have the same size.Sites with the same group name are collapsed to a single profile.
Strand tokens are
+,-, or.. With six or more columns, only column 6 is read as strand. With exactly five columns, column 5 may be strand. If column 5 looks stranded but column 6 exists and does not, the file is rejected as ambiguous.Forward and unknown strandedness use normal genomic order while reverse stranded intervals write into the grouped profiles in reverse order.
Required columns:
chromosome, start, end, group_name. No header. Optional fifth and sixth columns follow the strand rules above.Note: Besides chromosome-filtering and blacklist-filtering (see explanation in
--keep-blacklisted-intervals), no additional interval filtering is performed. It is up to the user to remove duplicate intervals, within-group overlapping intervals, and other potential clean-ups beforehand. -
--length-bins <LENGTH_BINS>...Edges of fragment length bins to count in
[string(s)]This also defines the min and max fragment lengths.
Accepted forms:
-
A single value with
start:end:step: Creates contiguous bins fromstarttoend(end-exclusive) instepincrements. Example:30:1000:10-> bins[30,40), [40,50), ..., [990,1000). -
Multiple integer values interpreted as bin edges: Example:
--length-bins 30 80 150 220 500 1001-> bins[30,80), [80,150), ..., [500,1001).
NOTE: Memory consumption increases linearly with the number of bins.
[default: 30 1001]
-
-
--tile-size <TILE_SIZE>Size of tiles to parallelize over
[integer]Chromosomes are processed in tiles of this size to reduce memory usage.
[default: 20000000]
Post-processing
-
--bin-size <BIN_SIZE>Average adjacent positions in bins after counting and smoothing
[integer]Defaults to 1 for full resolution.
[default: 1]
-
--smoothing <SMOOTHING>Smooth final midpoint profiles with an order-3 Savitzky-Golay filter
[string]By default smoothing is applied with a 165 bp window (nucleosome-scale). This was selected based on the defaults in Griffin.
Use
noneto write unsmoothed profiles.Use
savgol=<odd_bp>to set a different odd window size in base pairs. For example,--smoothing savgol=155uses a 155 bp smoothing window.When smoothing is active, intervals are counted with additional flank positions to avoid edge effects in the smoothed values. Flanked intervals cannot exceed chromosome boundaries. After smoothing final grouped profiles, the command trims the flanks and writes exactly the positions from
--intervals.[default: savgol=165]
Chromosome Selection (select max. one arg.)
-
--chromosomes <CHROMOSOMES>...Names of chromosomes to process (comma-separated or repeated). E.g.
'chr1,chr2,chr3'.When no chromosomes are specified, it defaults to
chr1..chr22.Specify
"all"as the only string to use all chromosomes from the command's configured contig source. -
--chromosomes-file <CHROMOSOMES_FILE>File with chromosome names to process (one per line)
Normalization
-
--scaling-factors <SCALING_FACTORS>Optional path to non-negative scaling factors for normalizing/smoothing the genome
[path].tsvfile as produced bycfdna fragment-count-weightsorcfdna coverage-weightscontaining a scaling factor to multiply by per scaling-bin.Files may start with comment metadata lines from
cfdna coverage-weights/fragment-count-weights, such as# gc_mode=corrected_tag.The scaling-bin-overlapping parts of the fragments are counted as the scaling factor of the bin.
File Requirements
The TSV file must have a header. Column names are matched case-insensitively.
Required columns:
chromosome,start,end,scaling_factor.Coordinates are 0-based, half-open
[start, end).Scaling factors must be finite and non-negative.
Bins are filtered to the provided
chromosomes.For every chromosome in
chromosomes, bins must:-
start at the 0-coordinate
-
be perfectly contiguous (no gaps, no overlaps)
-
end exactly at that chromosome's length
-
Filtering
-
--min-mapq <MIN_MAPQ>Minimum mapping quality to include
[integer][default: 30]
-
--require-proper-pairOnly count properly paired reads
[flag]This is NOT recommended by default as it trims the tails of the length distribution.
Note, that we only keep inward-directed fragments within the specified length range, so there's no real need for proper-pair filtering.
-
-b, --blacklist <BLACKLIST>...Optional BED file(s) with blacklisted regions
[path]NOTE: It may be an advantage to instead remove intervals that lie within half the maximum fragment length of blacklisted regions from the
--intervalsfile. -
--blacklist-min-size <BLACKLIST_MIN_SIZE>Minimum size of blacklist intervals to load (bp)
[integer][default: 1]
-
--blacklist-strategy <BLACKLIST_STRATEGY>The fragment positions that should overlap blacklisted regions for it to be excluded
[string]Possible values:
"any","all","midpoint", or"proportion=<threshold>"midpointchecks the single central base for odd fragments and either central base for even fragments.Example of proportion:
--blacklist-strategy proportion=0.2(no space around=)[default: any]
-
--keep-blacklisted-intervalsDon't filter out intervals that overlap nearby blacklisted regions
[flag]Edge bias: Fragments overlapping blacklisted bases are filtered before they can contribute midpoint counts. This can create artificial dips near profile edges if an interval is close enough to a blacklist for relevant fragments to be removed on one side but not the other.
To avoid that edge bias, intervals within
ceil(max_fragment_length / 2) + smoothing_flankfrom blacklisted regions are removed prior to counting.smoothing_flankis half the Savitzky-Golay window when smoothing is active and0when--smoothing none.Set this flag to keep those intervals in the site set. Fragment-level blacklist filtering still applies.
GC Correction (select max. one source)
-
--gc-file <GC_FILE>Optional path to GC correction file made from the same BAM file with
cfdna gc-bias[path]The file is usually called
gc_bias_correction.npz.NOTE: Requires specifying the reference genome 2bit file as well.
-
--gc-tag <GC_TAG>Optional aux tag to get GC weight from when using external GC correction packages
[string]The tag name must be exactly two ASCII characters matching the SAM/BAM AUX tag format: first character is a letter, second character is a letter or digit.
Packages like
GCParagonandGCfixallow saving GC weights directly to the reads in a BAM file. They often assign a "GC" aux tag.The average per-read weight is used to count the fragment. When any of the reads have a zero-weight, the fragment gets a zero-weight. If only one mate has a usable tag, that single usable weight is reused for the fragment.
-
--neutralize-invalid-gcKeep fragments with unusable GC weights and weight them as
1.0[flag]By default, fragments are skipped when the GC correction is missing, cannot be computed, or resolves to an unusable value. Set this flag to keep them instead and count them with neutral weight
1.0.
GC Correction
-
-r, --ref-2bit <REF_2BIT>Optional 2bit reference genome file [path]
NOTE: Required for GC correction, otherwise ignored.
E.g., "hg38.2bit" from UCSC ( https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit ).
Plotting
-
--plot-groups <PLOT_GROUPS>Group indices to plot as midpoint profiles
[integers]Comma separated list of zero-based group indices to plot after counting.
This plotting step is intended for quick QC of the outputs. It's not optimized for publication etc. (although feel free!)
[default: 0]
Logging
-
--log <LOG>Logging destination
[stdout|quiet|file|file=<path>]stdoutkeeps the normal run narrative on standard output.quietsuppresses the normal run narrative and progress bars, while warnings and errors still go tostderr.filewrites the normal run narrative to an auto-generated log file under the command output directory.file=<path>writes the normal run narrative to the exact path you provide.[default: stdout]