Interpret key regulators between region groups¶
This notebook demonstrates how to identify key regulators between region groups using the ChromBERT-tools Python API.
The interpret_regulator_effects_between_region_groups API generates regulator embeddings for different region groups and calculates the embedding difference of each regulator between groups as 1 − cosine similarity. Regulators with larger embedding differences are considered more likely to be key regulators.
For the bash command-line usage, see `examples/cli/interpret_regulator_effects_between_region_groups.ipynb <../cli/interpret_regulator_effects_between_region_groups.ipynb>`__.
For more details, please refer to the `interpret_regulator_effects_between_region_groups <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_regulator_effects_between_region_groups.html>`__ command documentation
[3]:
from chrombert_tools import interpret_regulator_effects_between_region_groups as run_key_regulator
pre-trained¶
[5]:
factor_importance_rank = run_key_regulator(
region1_file="../data/myoblast_ENCFF647RNC_peak_100.bed", # your focus region group 1
region2_file="../data/CTCF_ENCFF664UGR_sample100.bed", # your focus region group 2
odir="./output_regulator_effects_between_region_groups_pretrain", # output directory
genome="hg38", # genome
resolution="1kb", # resolution
batch_size=64, # batch size
)
Region summary - total: 100, overlapping with ChromBERT: 101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
100%|██████████| 2/2 [00:04<00:00, 2.43s/it]
100%|██████████| 2/2 [00:04<00:00, 2.03s/it]
Identify key regulators across regions(top 25)
factors similarity rank embedding_shift
0 ctcf 0.363011 1 0.636989
1 smc3 0.439149 2 0.560851
2 rad21 0.494077 3 0.505923
3 zbtb2 0.495611 4 0.504389
4 znf654 0.496850 5 0.503150
5 mbd4 0.520199 6 0.479801
6 smc1a 0.543333 7 0.456667
7 stag1 0.546467 8 0.453533
8 drap1 0.598942 9 0.401058
9 mbd1 0.600407 10 0.399593
10 trim22 0.607737 11 0.392263
11 brca1 0.623139 12 0.376861
12 sumo1 0.633337 13 0.366663
13 l3mbtl4 0.641373 14 0.358627
14 kdm3b 0.642861 15 0.357139
15 tgif2 0.647627 16 0.352373
16 irf3 0.648460 17 0.351540
17 srebf2 0.649150 18 0.350850
18 ruvbl1 0.657634 19 0.342366
19 fosl2 0.661870 20 0.338130
20 faire 0.662093 21 0.337907
21 foxm1 0.666393 22 0.333607
22 esrra 0.669006 23 0.330994
23 cebpz 0.675203 24 0.324797
24 nipbl 0.682350 25 0.317650
Finished
Used pre-trained ChromBERT
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_pretrain/results/factor_importance_rank.csv
[6]:
factor_importance_rank.head(n=25)
[6]:
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | ctcf | 0.363011 | 1 | 0.636989 |
| 1 | smc3 | 0.439149 | 2 | 0.560851 |
| 2 | rad21 | 0.494077 | 3 | 0.505923 |
| 3 | zbtb2 | 0.495611 | 4 | 0.504389 |
| 4 | znf654 | 0.496850 | 5 | 0.503150 |
| 5 | mbd4 | 0.520199 | 6 | 0.479801 |
| 6 | smc1a | 0.543333 | 7 | 0.456667 |
| 7 | stag1 | 0.546467 | 8 | 0.453533 |
| 8 | drap1 | 0.598942 | 9 | 0.401058 |
| 9 | mbd1 | 0.600407 | 10 | 0.399593 |
| 10 | trim22 | 0.607737 | 11 | 0.392263 |
| 11 | brca1 | 0.623139 | 12 | 0.376861 |
| 12 | sumo1 | 0.633337 | 13 | 0.366663 |
| 13 | l3mbtl4 | 0.641373 | 14 | 0.358627 |
| 14 | kdm3b | 0.642861 | 15 | 0.357139 |
| 15 | tgif2 | 0.647627 | 16 | 0.352373 |
| 16 | irf3 | 0.648460 | 17 | 0.351540 |
| 17 | srebf2 | 0.649150 | 18 | 0.350850 |
| 18 | ruvbl1 | 0.657634 | 19 | 0.342366 |
| 19 | fosl2 | 0.661870 | 20 | 0.338130 |
| 20 | faire | 0.662093 | 21 | 0.337907 |
| 21 | foxm1 | 0.666393 | 22 | 0.333607 |
| 22 | esrra | 0.669006 | 23 | 0.330994 |
| 23 | cebpz | 0.675203 | 24 | 0.324797 |
| 24 | nipbl | 0.682350 | 25 | 0.317650 |
cell-type-specific fine-tuned model¶
[ ]:
# # Download example data
# # Myoblast and fibroblast data: ATAC-seq bigWig and peak files
# import subprocess
# import os
# if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
# cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
# subprocess.run(cmd, shell=True)
# cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
# subprocess.run(cmd, shell=True)
# if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
# cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
# subprocess.run(cmd, shell=True)
### fine-tuned a cell-type-specific model
# from chrombert_tools import region_activity_regression
# results_myoblast_specific = region_activity_regression(
# odir = "./output_cell_specific_emb_train", # output directory
# cell_type_bw = "../data/myoblast_ENCFF149ERN_signal.bigwig", # your focus cell-type accessibility data
# cell_type_peak = "../data/myoblast_ENCFF647RNC_peak.bed", # your focus cell-type peak data
# genome = "hg38", # genome
# resolution = "1kb", # resolution
# )
[ ]:
import glob
ft_ckpt_dir = "./output_cell_specific_emb_train/train/**/*.ckpt" # Use checkpoints from embed_region.ipynb if available; otherwise, run the code above first
ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt
'./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt'
[ ]:
factor_importance_rank_ft = run_key_regulator(
region1_file="./output_cell_specific_emb_train/dataset/highly_accessible_region.csv", # your focus region group 1, generated from embed_region.ipynb
region2_file="./output_cell_specific_emb_train/dataset/background_region.csv", # your focus region group 2, generated from embed_region.ipynb
odir="./output_regulator_effects_between_region_groups_finetuned", # output directory
genome="hg38", # genome
resolution="1kb", # resolution
batch_size=64, # batch size
ft_ckpt = ft_ckpt # fine-tuned checkpoint
)
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 16/16 [00:17<00:00, 1.11s/it]
100%|██████████| 16/16 [00:18<00:00, 1.15s/it]
Identify key regulators across regions(top 25)
factors similarity rank embedding_shift
0 myod1 0.180347 1 0.819653
1 myf5 0.193729 2 0.806271
2 myog 0.254071 3 0.745929
3 tead1 0.288621 4 0.711379
4 yap1 0.289305 5 0.710695
5 tcf21 0.291978 6 0.708022
6 cbx6 0.356214 7 0.643786
7 ring1 0.380347 8 0.619653
8 foxo1 0.385972 9 0.614028
9 pax3-foxo1a 0.415852 10 0.584148
10 neurog2 0.423216 11 0.576784
11 chd4 0.440935 12 0.559065
12 dnase 0.447499 13 0.552501
13 nr3c1 0.448599 14 0.551401
14 kdm6b 0.461739 15 0.538261
15 rb1 0.471630 16 0.528370
16 rnf2 0.473779 17 0.526221
17 cbx7 0.486125 18 0.513875
18 pgbd3 0.492760 19 0.507240
19 pax7 0.513177 20 0.486823
20 cbx8 0.532263 21 0.467737
21 wwtr1 0.538589 22 0.461411
22 esco2 0.552305 23 0.447695
23 ep300 0.554074 24 0.445926
24 klf11 0.558791 25 0.441209
Finished
Used fine-tuned ChromBERT checkpoint: ./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_finetuned/results/factor_importance_rank.csv
[13]:
factor_importance_rank_ft
[13]:
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | myod1 | 0.180347 | 1 | 0.819653 |
| 1 | myf5 | 0.193729 | 2 | 0.806271 |
| 2 | myog | 0.254071 | 3 | 0.745929 |
| 3 | tead1 | 0.288621 | 4 | 0.711379 |
| 4 | yap1 | 0.289305 | 5 | 0.710695 |
| ... | ... | ... | ... | ... |
| 1067 | lhx3 | 0.961406 | 1068 | 0.038594 |
| 1068 | nono | 0.961549 | 1069 | 0.038451 |
| 1069 | znf26 | 0.962108 | 1070 | 0.037892 |
| 1070 | fev | 0.962952 | 1071 | 0.037048 |
| 1071 | snapc4 | 0.963752 | 1072 | 0.036248 |
1072 rows × 4 columns