Skip to main content

Common Files

The cfDNAlab commands use a common set of reference files specific to your chosen assembly (e.g. hg38). This section describes the most common files and shows where to find them for hg38. The commands should work with any assembly.

The following external files are used by the main commands (some are optional):

FileFormatArgumentsWhere to get it
Sample alignmentCoordinate-sorted BAM + index (.bai)--bamFrom your alignment pipeline or preprocessing workflow
Reference genome.2bit--ref-2bitDownload the exact assembly that matches your sample alignment. E.g. from: https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.2bit
Blacklist regionsBED--blacklistDownload one or more assembly-matched blacklist BEDs. E.g. from: https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
WindowsBED-like file--by-bed, --by-grouped-bedFrom your own analysis design or external source.
IntervalsBED-like file--intervalsFrom your own analysis design or external source.

In general, we recommend reusing the same blacklists across all steps in the analyses. Exceptions can be sample-specific blacklists that are only relevant for the last feature extraction steps.

The following files are made with cfDNAlab and passed to the main feature extraction commands (except the Reference GC file which is necessary for calculating the GC-bias):

FileFormatArgumentHow to make it
Sample scaling factors.scaling_factors.tsv--scaling-factorsCreate with cfdna fragment-count-weights and/or cfdna coverage-weights per sample
Sample GC correction filegc_bias_correction.npz--gc-fileCreate with cfdna gc-bias per sample
Reference GC filePackage produced by cfdna ref-gc-bias--ref-gc-fileCreate once per assembly, then reuse it across samples when calling cfdna gc-bias

Quick explanation

The BAM file (--bam) contains the actual sample-specific sequencing data, that you wish to extract features from. It has been aligned to a reference genome (--ref-2bit). Blacklists show which regions of that genome are hard to map to. Those regions add noise to our features, so we usually exclude them from the analysis (--blacklist).

Windows are genomic intervals that we are specifically interested in (--by-bed, --by-grouped-bed). Depending on the command and settings, we either get separate features per window/group or include only the specified window positions.

Intervals are fixed-size genomic intervals (--intervals) used by cfdna midpoints to create midpoint coverage profiles.

GC-bias: Fragmentation patterns in cfDNA are vulnerable to GC-bias (see the GC-bias guide). cfdna gc-bias calculates this bias for a given sample BAM file, which can then be passed to the feature extraction commands to correct the bias in the features. Before we can calculate the sample bias though, we need to know the GC-bias in reference genome. This should be done once per assembly using cfdna ref-gc-bias.

Genomic smoothing: Use genomic smoothing when you care about local changes in fragment counts or coverage. This makes all non-blacklisted genomic regions contribute roughly the same total weight to the features. There are two related modes: coverage, where longer fragments count more because they cover more positions, and fragment counts, where each fragment has the same total weight regardless of length. We calculate local fragment counts or coverage in large genomic windows, then divide each fragment’s contribution to features by the count or coverage of the windows it overlaps. cfdna fragment-count-weights and cfdna coverage-weights calculate these scaling factors once per sample BAM file. They use a running-window, triangular weighting scheme, which gives a smooth effect similar to Gaussian smoothing. Pass the resulting scaling factors into feature extraction commands.

Store paths in bash variables

To avoid writing the filepaths again and again, you can assign them to shell variables. Of course, some of the files need to be created first. Adjust these to your paths.

Tip: Use quotes around variable values and around variable expansions in commands. That protects you from path names with spaces.

# Project / assembly level
PROJECT_DIR="$HOME/cfdna_project"
REF_2BIT="$PROJECT_DIR/refs/hg38.2bit"
REF_GC_FILE="$PROJECT_DIR/refs/ref_gc/hg38.ref_gc_package.npz"
WINDOWS="$PROJECT_DIR/regions/windows.bed"
INTERVALS="$PROJECT_DIR/regions/intervals.bed"

BLACKLIST_PRIMARY="$PROJECT_DIR/refs/blacklist/hg38-blacklist.bed"
BLACKLIST_EXTRA="$PROJECT_DIR/refs/blacklist/custom-mask.bed"
BLACKLIST_ARGS=(
--blacklist "$BLACKLIST_PRIMARY"
--blacklist "$BLACKLIST_EXTRA"
)

# Sample level
SAMPLE_ID="sample_01"
BAM="$PROJECT_DIR/inputs/$SAMPLE_ID.bam"
GC_FILE="$PROJECT_DIR/outputs/$SAMPLE_ID/gc_bias/gc_bias_correction.npz"
SCALING_FACTORS="$PROJECT_DIR/outputs/$SAMPLE_ID/scaling_factors/$SAMPLE_ID.scaling_factors.tsv"

If you only use one blacklist file, you can skip the array and keep a single variable:

BLACKLIST="$PROJECT_DIR/refs/blacklist/hg38-blacklist.bed"

Use the variables in commands

Note: The below examples only show the arguments for the shell variables, they are not full examples.

Here is a sample-specific gc-bias call that reuses the variables above:

cfdna gc-bias \
--bam "$BAM" \
--output-dir "$PROJECT_DIR/outputs/$SAMPLE_ID/gc_bias" \
--ref-2bit "$REF_2BIT" \
--ref-gc-file "$REF_GC_FILE" \
"${BLACKLIST_ARGS[@]}"

Here is a downstream feature extraction call using the same shared variables plus the derived sample files:

cfdna midpoints \
--bam "$BAM" \
--output-dir "$PROJECT_DIR/outputs/$SAMPLE_ID/midpoints" \
--intervals "$INTERVALS" \
--ref-2bit "$REF_2BIT" \
--gc-file "$GC_FILE" \
--scaling-factors "$SCALING_FACTORS" \
"${BLACKLIST_ARGS[@]}"

Conceptual folder layout

While you can use any folder structure you want, the below layout conceptualizes the various types of files.

Keeping shared reference files separate from sample-specific outputs makes the pipeline easier to understand:

project/
├── refs/
│ ├── hg38.2bit
│ ├── blacklist/
│ │ └── hg38-blacklist.bed
│ └── ref_gc/
├── inputs/
│ └── sample_01.bam
├── regions/
│ └── intervals.bed
└── outputs/
└── sample_01/
├── gc_bias/
│ └── sample_01.gc_bias_correction.npz
├── scaling_factors/
│ ├── sample_01.coverage.scaling_factors.tsv
│ └── sample_01.fragment_counts.scaling_factors.tsv
└── lengths/
└── sample_01.length_counts.npy

Next step

Once you have these files in place, continue with the Guides section to find the right workflow for your analysis.