Skip to main content

cfdna bam-to-frag

Write the fragments from a BAM file to a finaleDB-style frag file.

Information in the .frag.tsv file:

  • Chromosome

  • Start: forward.pos

  • End: reverse.end

  • MapQ: Minimum mapping quality for the two reads

  • Strand: The strand alignment of read1

AND, when one or more of --gc-file, --coverage-scaling-factors, and --count-scaling-factors are specified:

  • GC Weight: The multiplicative weight needed to correct for GC bias.

  • Coverage-based scaling weight: The multiplicative weight needed to perform fragment coverage-based genomic smoothing.

  • Count-based scaling weight: The multiplicative weight needed to perform fragment count-based genomic smoothing.

The accompanying *.frag.header.tsv file has the matching column names: gc_weight, coverage_scaling_weight, and count_scaling_weight.

Fragments are sorted by (chromosome, start, end), using the chromosome order in --chromosomes.

Always-on exclusion criteria

The following criteria always exclude a read:

The read is secondary, supplementary or duplicate. The read failed quality check.

Paired-end input only: The read or mate read is unmapped. The read is mapped to a different tid than the mate. The paired reads are not inwardly directed (we require: start(forward) <= start(reverse)).


Usage

cfdna bam-to-frag [OPTIONS] --bam <BAM> --output-dir <OUTPUT_DIR>

Options

  • -h, --help

    Print help (see a summary with '-h')

Core

  • -i, --bam <BAM>

    Indexed, coordinate-sorted BAM input file [path]

    Can be either paired-end or unpaired (set --reads-are-fragments). Unpaired assumes the reads span their fragments exactly (so read size is fragment size).

  • -o, --output-dir <OUTPUT_DIR>

    Output directory for results [path]

  • -t, --n-threads <N_THREADS>

    Number of threads to use (increases RAM usage) [integer]

    Defaults to the number of available CPU cores (-1).

    [default: auto]

  • --reads-are-fragments

    The input has one read per fragment and the read spans the full aligned fragment (e.g. Nanopore) [flag]

    Each aligned read is treated as a fragment spanning its aligned reference interval [pos, reference_end). Some commands allow expanding this to include soft clipped bases.

    Cannot be combined with --require-proper-pair (when available).

  • -x, --output-prefix <OUTPUT_PREFIX>

    Optional prefix for output file (e.g., a sample name) [string]

    Leave empty to write filenames without a leading prefix.

    E.g., specify to enable writing to the same output directory from multiple calls to this software.

    Examples produce files like: <prefix>.frag.tsv.gz

Windows

  • --by-bed <BY_BED>

    Intervals to keep overlapping fragments from [path]

Chromosome Selection (select max. one arg.)

  • --chromosomes <CHROMOSOMES>...

    Names of chromosomes to process (comma-separated or repeated). E.g. 'chr1,chr2,chr3'.

    When no chromosomes are specified, it defaults to chr1..chr22.

    Specify "all" as the only string to use all chromosomes from the command's configured contig source.

  • --chromosomes-file <CHROMOSOMES_FILE>

    File with chromosome names to process (one per line)

Normalization

  • --coverage-scaling-factors <COVERAGE_SCALING_FACTORS>

    Optional path to coverage-based scaling factors [path]

    .tsv file as produced by cfdna coverage-weights.

  • --count-scaling-factors <COUNT_SCALING_FACTORS>

    Optional path to fragment count-based scaling factors [path]

    .tsv file as produced by cfdna fragment-count-weights.

Filtering

  • --min-fragment-length <MIN_FRAGMENT_LENGTH>

    Minimum fragment length to include [integer]

    [default: 30]

  • --max-fragment-length <MAX_FRAGMENT_LENGTH>

    Maximum fragment length to include [integer]

    [default: 1000]

  • --min-mapq <MIN_MAPQ>

    Minimum mapping quality to include [integer]

    Defaults to 0 to allow making filtering decisions downstream.

    [default: 0]

  • --require-proper-pair

    Only count properly paired reads [flag]

    This is NOT recommended by default, as it trims the tails of the length distribution. It may be useful to match the files in FinaleDB.

    Note, that we only keep inward-directed fragments within the specified length range, so there's no real need for proper-pair filtering.

  • -b, --blacklist <BLACKLIST>...

    Optional BED file(s) with blacklisted regions [path]

  • --blacklist-min-size <BLACKLIST_MIN_SIZE>

    Minimum size of blacklist intervals to load (bp) [integer]

    [default: 1]

  • --blacklist-strategy <BLACKLIST_STRATEGY>

    The fragment positions that should overlap blacklisted regions for it to be excluded [string]

    Possible values: "any", "all", "midpoint", or "proportion=<threshold>"

    Example of proportion: --blacklist-strategy proportion=0.2 (no space around =)

    [default: any]

GC Correction

  • --gc-file <GC_FILE>

    Optional path to GC correction file made from the same BAM file with cfdna gc-bias [path]

    The file is usually called gc_bias_correction.npz.

    NOTE: Requires specifying the reference genome 2bit file as well.

  • --neutralize-invalid-gc

    Keep fragments with unusable GC weights and weight them as 1.0 [flag]

    By default, fragments are skipped when the GC correction cannot be computed or resolves to an unusable value. Set this flag to keep them instead and count them with neutral weight 1.0.

  • -r, --ref-2bit <REF_2BIT>

    Optional 2bit reference genome file [path]

    NOTE: Required for GC correction, otherwise ignored.

    E.g., "hg38.2bit" from UCSC ( https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit ).

Logging

  • --log <LOG>

    Logging destination [stdout|quiet|file|file=<path>]

    stdout keeps the normal run narrative on standard output.

    quiet suppresses the normal run narrative and progress bars, while warnings and errors still go to stderr.

    file writes the normal run narrative to an auto-generated log file under the command output directory.

    file=<path> writes the normal run narrative to the exact path you provide.

    [default: stdout]