Predict TF binding regions

The predict_tf_binding_regions command uses ChromBERT’s learned co-association patterns to predict TF bindings (e.g., ChIP-seq) for factor–cell pairs where experimental data is unavailable.

Note: The remaining examples show command-line usage only (bash).

For the Python API, see `examples/api/predict_tf_binding_regions.ipynb <../api/predict_tf_binding_regions>`__.

If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.

For more details, please refer to the `predict_tf_binding_regions <https://chrombert-tools.readthedocs.io/en/latest/commands/predict_tf_binding_regions.html>`__ command documentation.

[1]:
import pandas as pd
import numpy as np
import os
[ ]:
# options parameter
!chrombert-tools predict_tf_binding_regions -h
Usage: chrombert-tools predict_tf_binding_regions [OPTIONS]

  Predict TF binding on specified regions across cell types.

Options:
  --region FILE                   Region BED file.  [required]
  --cistrome TEXT                 factor:cell e.g.
                                  BCL11A:GM12878;BRD4:MCF7;CTCF:HepG2. Use ';'
                                  to separate multiple cistromes.  [required]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --oname TEXT                    Output name prefix.  [default:
                                  cistrome_impute]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb]              Resolution. Only supports 1kb resolution in
                                  imputing cistromes task.  [default: 1kb]
  --batch-size INTEGER            Batch size.  [default: 4]
  --num-workers INTEGER           Dataloader workers.  [default: 8]
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache directory (containing
                                  config/ and anno/ subfolders).  [default:
                                  ~/.cache/chrombert/data]
  -h, --help                      Show this message and exit.
[ ]:
# --cistrome: # your cistrome: TF:cell_type, separated by semicolons.
# --region: focus regions
# --odir: output directory
# --genome: genome
# --resolution: resolution

!chrombert-tools predict_tf_binding_regions \
    --cistrome "BCL11A:GM12878;BRD4:MCF7;CTCF:HepG2;MYC:H1;MYC:h9;SPI1:GSM2702714" \
    --region "../data/CTCF_ENCFF664UGR_sample100.bed" \
    --odir "./output_predict_tf_binding_regions" \
    --genome "hg38" \
    --resolution "1kb"

Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
celltype: h1 has no corresponding wild type dnase data in ChromBERT.
Note: All cistromes names were converted to lowercase for matching.
Cistromes count summary - requested: 6, matched in ChromBERT: 5, not found: 1, not found cistromes: ['myc:h1']
ChromBERT cistromes metas: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.tsv
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_prompt_cistrome.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 112/112 parameters
Imputing cistromes: 100%|███████████████████████| 25/25 [00:05<00:00,  4.79it/s]

Finished imputing cistromes on specific regions.
Focus region summary - total: 100, overlapping with ChromBERT: 100, non-overlapping: 0
Overlapping regions BED file: ./output_predict_tf_binding_regions/overlap_region.bed
Non-overlapping regions BED file: ./output_predict_tf_binding_regions/no_overlap_region.bed
Results saved to: ./output_predict_tf_binding_regions/results_prob_df.csv
Results track files saved to: ./output_predict_tf_binding_regions/*.bw
[6]:
# results_pro_df: Imputed peak probabilities.
results_pro_df = pd.read_csv("./output_predict_tf_binding_regions/results_prob_df.csv")
results_pro_df

[6]:
input_chrom input_start input_end chrombert_build_region_index chrombert_start chrombert_end bcl11a:gm12878 brd4:mcf7 ctcf:hepg2 myc:h9 spi1:gsm2702714
0 chr1 37989946 37990368 32658 37990000 37991000 0.781250 0.660156 0.984375 0.972656 0.632812
1 chr11 2400199 2400617 289179 2400000 2401000 0.664062 0.570312 0.972656 0.882812 0.917969
2 chr12 6778809 6779319 391108 6779000 6780000 0.527344 0.412109 0.980469 0.871094 0.503906
3 chr12 52980788 52981316 424926 52981000 52982000 0.174805 0.601562 0.976562 0.812500 0.345703
4 chr12 53676021 53676448 425578 53676000 53677000 0.494141 0.699219 0.968750 0.945312 0.570312
... ... ... ... ... ... ... ... ... ... ... ...
95 chr6 53171843 53172315 1660979 53172000 53173000 0.408203 0.474609 0.988281 0.566406 0.617188
96 chr6 131628105 131628616 1713078 131628000 131629000 0.632812 0.667969 0.988281 0.894531 0.773438
97 chr6 158704189 158704642 1735665 158704000 158705000 0.558594 0.251953 0.972656 0.613281 0.554688
98 chr9 128117589 128118035 2049996 128117000 128118000 0.597656 0.468750 0.972656 0.812500 0.468750
99 chr9 136122853 136123320 2057396 136123000 136124000 0.167969 0.310547 0.968750 0.394531 0.400391

100 rows × 11 columns

[ ]:

[ ]: