Predict TF binding regions

This notebook shows how to use the ChromBERT-tools Python API predict_tf_binding_regions to predict TF binding regions.

For the bash command-line usage, see `examples/cli/predict_tf_binding_regions.ipynb <../cli/predict_tf_binding_regions.ipynb>`__.

For more details, please refer to the `predict_tf_binding_regions <https://chrombert-tools.readthedocs.io/en/latest/commands/predict_tf_binding_regions.html>`__ command documentation.

[ ]:
from chrombert_tools import predict_tf_binding_regions
[ ]:
# Return
# results_pro_df: Imputed peak probabilities.

results_pro_df = predict_tf_binding_regions(
    cistrome = "BCL11A:GM12878;BRD4:MCF7;CTCF:HepG2;MYC:H1;MYC:h9;SPI1:GSM2702714", # your cistrome: TF:cell_type
    region="../data/CTCF_ENCFF664UGR_sample100.bed", # your input regions
    odir="./output_tf_binding_regions", # output directory
    genome="hg38", # Options: mouse
    resolution="1kb", # Options: only 1kb
)
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
celltype: h1 has no corresponding wild type dnase data in ChromBERT.
Note: All cistromes names were converted to lowercase for matching.
Cistromes count summary - requested: 6, matched in ChromBERT: 5, not found: 1, not found cistromes: ['myc:h1']
ChromBERT cistromes metas: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.tsv
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_prompt_cistrome.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 112/112 parameters
Imputing cistromes: 100%|██████████| 25/25 [00:05<00:00,  4.77it/s]

Finished imputing cistromes on specific regions.
Focus region summary - total: 100, overlapping with ChromBERT: 100, non-overlapping: 0
Overlapping regions BED file: ./output_tf_binding_regions/overlap_region.bed
Non-overlapping regions BED file: ./output_tf_binding_regions/no_overlap_region.bed
Results saved to: ./output_tf_binding_regions/results_prob_df.csv
Results track files saved to: ./output_tf_binding_regions/*.bw

[5]:

results_pro_df
[5]:
input_chrom input_start input_end chrombert_build_region_index chrombert_start chrombert_end bcl11a:gm12878 brd4:mcf7 ctcf:hepg2 myc:h9 spi1:gsm2702714
0 chr1 37989946 37990368 32658 37990000 37991000 0.781250 0.660156 0.984375 0.972656 0.632812
1 chr11 2400199 2400617 289179 2400000 2401000 0.664062 0.570312 0.972656 0.882812 0.917969
2 chr12 6778809 6779319 391108 6779000 6780000 0.527344 0.412109 0.980469 0.871094 0.503906
3 chr12 52980788 52981316 424926 52981000 52982000 0.174805 0.601562 0.976562 0.812500 0.345703
4 chr12 53676021 53676448 425578 53676000 53677000 0.494141 0.699219 0.968750 0.945312 0.570312
... ... ... ... ... ... ... ... ... ... ... ...
95 chr6 53171843 53172315 1660979 53172000 53173000 0.408203 0.474609 0.988281 0.566406 0.617188
96 chr6 131628105 131628616 1713078 131628000 131629000 0.632812 0.667969 0.988281 0.894531 0.773438
97 chr6 158704189 158704642 1735665 158704000 158705000 0.558594 0.251953 0.972656 0.613281 0.554688
98 chr9 128117589 128118035 2049996 128117000 128118000 0.597656 0.468750 0.972656 0.812500 0.468750
99 chr9 136122853 136123320 2057396 136123000 136124000 0.167969 0.310547 0.968750 0.394531 0.400391

100 rows × 11 columns

[ ]:

[ ]: