Predict TF binding regions¶
The predict_tf_binding_regions command uses ChromBERT’s learned co-association patterns to predict TF bindings (e.g., ChIP-seq) for factor–cell pairs where experimental data is unavailable.
Note: The remaining examples show command-line usage only (bash).
For the Python API, see `examples/api/predict_tf_binding_regions.ipynb <../api/predict_tf_binding_regions>`__.
If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.
For more details, please refer to the `predict_tf_binding_regions <https://chrombert-tools.readthedocs.io/en/latest/commands/predict_tf_binding_regions.html>`__ command documentation.
[1]:
import pandas as pd
import numpy as np
import os
[ ]:
# options parameter
!chrombert-tools predict_tf_binding_regions -h
Usage: chrombert-tools predict_tf_binding_regions [OPTIONS]
Predict TF binding on specified regions across cell types.
Options:
--region FILE Region BED file. [required]
--cistrome TEXT factor:cell e.g.
BCL11A:GM12878;BRD4:MCF7;CTCF:HepG2. Use ';'
to separate multiple cistromes. [required]
--odir DIRECTORY Output directory. [default: ./output]
--oname TEXT Output name prefix. [default:
cistrome_impute]
--genome [hg38|mm10] Genome. [default: hg38]
--resolution [1kb] Resolution. Only supports 1kb resolution in
imputing cistromes task. [default: 1kb]
--batch-size INTEGER Batch size. [default: 4]
--num-workers INTEGER Dataloader workers. [default: 8]
--chrombert-cache-dir DIRECTORY
ChromBERT cache directory (containing
config/ and anno/ subfolders). [default:
~/.cache/chrombert/data]
-h, --help Show this message and exit.
[ ]:
# --cistrome: # your cistrome: TF:cell_type, separated by semicolons.
# --region: focus regions
# --odir: output directory
# --genome: genome
# --resolution: resolution
!chrombert-tools predict_tf_binding_regions \
--cistrome "BCL11A:GM12878;BRD4:MCF7;CTCF:HepG2;MYC:H1;MYC:h9;SPI1:GSM2702714" \
--region "../data/CTCF_ENCFF664UGR_sample100.bed" \
--odir "./output_predict_tf_binding_regions" \
--genome "hg38" \
--resolution "1kb"
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
celltype: h1 has no corresponding wild type dnase data in ChromBERT.
Note: All cistromes names were converted to lowercase for matching.
Cistromes count summary - requested: 6, matched in ChromBERT: 5, not found: 1, not found cistromes: ['myc:h1']
ChromBERT cistromes metas: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.tsv
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_prompt_cistrome.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 112/112 parameters
Imputing cistromes: 100%|███████████████████████| 25/25 [00:05<00:00, 4.79it/s]
Finished imputing cistromes on specific regions.
Focus region summary - total: 100, overlapping with ChromBERT: 100, non-overlapping: 0
Overlapping regions BED file: ./output_predict_tf_binding_regions/overlap_region.bed
Non-overlapping regions BED file: ./output_predict_tf_binding_regions/no_overlap_region.bed
Results saved to: ./output_predict_tf_binding_regions/results_prob_df.csv
Results track files saved to: ./output_predict_tf_binding_regions/*.bw
[6]:
# results_pro_df: Imputed peak probabilities.
results_pro_df = pd.read_csv("./output_predict_tf_binding_regions/results_prob_df.csv")
results_pro_df
[6]:
| input_chrom | input_start | input_end | chrombert_build_region_index | chrombert_start | chrombert_end | bcl11a:gm12878 | brd4:mcf7 | ctcf:hepg2 | myc:h9 | spi1:gsm2702714 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | 37989946 | 37990368 | 32658 | 37990000 | 37991000 | 0.781250 | 0.660156 | 0.984375 | 0.972656 | 0.632812 |
| 1 | chr11 | 2400199 | 2400617 | 289179 | 2400000 | 2401000 | 0.664062 | 0.570312 | 0.972656 | 0.882812 | 0.917969 |
| 2 | chr12 | 6778809 | 6779319 | 391108 | 6779000 | 6780000 | 0.527344 | 0.412109 | 0.980469 | 0.871094 | 0.503906 |
| 3 | chr12 | 52980788 | 52981316 | 424926 | 52981000 | 52982000 | 0.174805 | 0.601562 | 0.976562 | 0.812500 | 0.345703 |
| 4 | chr12 | 53676021 | 53676448 | 425578 | 53676000 | 53677000 | 0.494141 | 0.699219 | 0.968750 | 0.945312 | 0.570312 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | chr6 | 53171843 | 53172315 | 1660979 | 53172000 | 53173000 | 0.408203 | 0.474609 | 0.988281 | 0.566406 | 0.617188 |
| 96 | chr6 | 131628105 | 131628616 | 1713078 | 131628000 | 131629000 | 0.632812 | 0.667969 | 0.988281 | 0.894531 | 0.773438 |
| 97 | chr6 | 158704189 | 158704642 | 1735665 | 158704000 | 158705000 | 0.558594 | 0.251953 | 0.972656 | 0.613281 | 0.554688 |
| 98 | chr9 | 128117589 | 128118035 | 2049996 | 128117000 | 128118000 | 0.597656 | 0.468750 | 0.972656 | 0.812500 | 0.468750 |
| 99 | chr9 | 136122853 | 136123320 | 2057396 | 136123000 | 136124000 | 0.167969 | 0.310547 | 0.968750 | 0.394531 | 0.400391 |
100 rows × 11 columns
[ ]:
[ ]: