Skip to main content

Extract Fragment Midpoint Profiles

Multiple studies have profiled midpoint coverage around e.g. transcription factor binding sites (summed per transcription factor, per position) [REFS]. This can inform about the binding activity of different transcription factors related to cancer.

Base command

cfdna midpoints --help

cfdna midpoints \
--bam <sample>.bam \
--output-dir <sample_directory>/midpoints \
--output-prefix <sample_id> \
--n-threads 12 \
--intervals <fixed_size_intervals>.tsv \
--blacklist <path>/hg38-blacklist.v2.bed \
--blacklist <path>/<another_blacklist>.bed \
--length-bins {30..1000..10}

GC-bias correction example

cfdna midpoints \
--bam <sample>.bam \
--output-dir <sample_directory>/midpoints \
--output-prefix <sample_id> \
--n-threads 12 \
--intervals <fixed_size_intervals>.tsv \
--blacklist <path>/hg38-blacklist.v2.bed \
--blacklist <path>/<another_blacklist>.bed \
--length-bins {30..1000..10} \
--gc-file <sample_directory>/gc_bias/gc_bias_correction.npz \
--ref-2bit <path>/hg38.2bit

Genomic smoothing example

cfdna midpoints \
--bam <sample>.bam \
--output-dir <sample_directory>/midpoints \
--output-prefix <sample_id> \
--n-threads 12 \
--intervals <fixed_size_intervals>.tsv \
--blacklist <path>/hg38-blacklist.v2.bed \
--blacklist <path>/<another_blacklist>.bed \
--length-bins {30..1000..10} \
--scaling-factors <sample_directory>/coverage_weights/<sample_id>.scaling_factors.tsv

GC-bias correction + genomic smoothing

cfdna midpoints \
--bam <sample>.bam \
--output-dir <sample_directory>/midpoints \
--output-prefix <sample_id> \
--n-threads 12 \
--intervals <fixed_size_intervals>.tsv \
--blacklist <path>/hg38-blacklist.v2.bed \
--blacklist <path>/<another_blacklist>.bed \
--length-bins {30..1000..10} \
--gc-file <sample_directory>/gc_bias/gc_bias_correction.npz \
--ref-2bit <path>/hg38.2bit \
--scaling-factors <sample_directory>/coverage_weights/<sample_id>.scaling_factors.tsv

The intervals must have the same fixed size. The expected columns are: chromosome, start, end, group_name (where group_name is the group to collapse profiles by, e.g., the transcription factor ID). The intervals should be sorted by chromosome and start coordinates.