Interpret key regulators between region groups

This notebook demonstrates how to identify key regulators between region groups using the ChromBERT-tools Python API.

The interpret_regulator_effects_between_region_groups API generates regulator embeddings for different region groups and calculates the embedding difference of each regulator between groups as 1 − cosine similarity. Regulators with larger embedding differences are considered more likely to be key regulators.

For the bash command-line usage, see `examples/cli/interpret_regulator_effects_between_region_groups.ipynb <../cli/interpret_regulator_effects_between_region_groups.ipynb>`__.

For more details, please refer to the `interpret_regulator_effects_between_region_groups <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_regulator_effects_between_region_groups.html>`__ command documentation

[3]:
from chrombert_tools import interpret_regulator_effects_between_region_groups as run_key_regulator

pre-trained

[5]:
factor_importance_rank = run_key_regulator(
    region1_file="../data/myoblast_ENCFF647RNC_peak_100.bed", # your focus region group 1
    region2_file="../data/CTCF_ENCFF664UGR_sample100.bed", # your focus region group 2
    odir="./output_regulator_effects_between_region_groups_pretrain", # output directory
    genome="hg38", # genome
    resolution="1kb", # resolution
    batch_size=64, # batch size
)
Region summary - total: 100, overlapping with ChromBERT: 101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
100%|██████████| 2/2 [00:04<00:00,  2.43s/it]
100%|██████████| 2/2 [00:04<00:00,  2.03s/it]
Identify key regulators across regions(top 25)
    factors  similarity  rank  embedding_shift
0      ctcf    0.363011     1         0.636989
1      smc3    0.439149     2         0.560851
2     rad21    0.494077     3         0.505923
3     zbtb2    0.495611     4         0.504389
4    znf654    0.496850     5         0.503150
5      mbd4    0.520199     6         0.479801
6     smc1a    0.543333     7         0.456667
7     stag1    0.546467     8         0.453533
8     drap1    0.598942     9         0.401058
9      mbd1    0.600407    10         0.399593
10   trim22    0.607737    11         0.392263
11    brca1    0.623139    12         0.376861
12    sumo1    0.633337    13         0.366663
13  l3mbtl4    0.641373    14         0.358627
14    kdm3b    0.642861    15         0.357139
15    tgif2    0.647627    16         0.352373
16     irf3    0.648460    17         0.351540
17   srebf2    0.649150    18         0.350850
18   ruvbl1    0.657634    19         0.342366
19    fosl2    0.661870    20         0.338130
20    faire    0.662093    21         0.337907
21    foxm1    0.666393    22         0.333607
22    esrra    0.669006    23         0.330994
23    cebpz    0.675203    24         0.324797
24    nipbl    0.682350    25         0.317650
Finished
Used pre-trained ChromBERT
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_pretrain/results/factor_importance_rank.csv
[6]:
factor_importance_rank.head(n=25)
[6]:
factors similarity rank embedding_shift
0 ctcf 0.363011 1 0.636989
1 smc3 0.439149 2 0.560851
2 rad21 0.494077 3 0.505923
3 zbtb2 0.495611 4 0.504389
4 znf654 0.496850 5 0.503150
5 mbd4 0.520199 6 0.479801
6 smc1a 0.543333 7 0.456667
7 stag1 0.546467 8 0.453533
8 drap1 0.598942 9 0.401058
9 mbd1 0.600407 10 0.399593
10 trim22 0.607737 11 0.392263
11 brca1 0.623139 12 0.376861
12 sumo1 0.633337 13 0.366663
13 l3mbtl4 0.641373 14 0.358627
14 kdm3b 0.642861 15 0.357139
15 tgif2 0.647627 16 0.352373
16 irf3 0.648460 17 0.351540
17 srebf2 0.649150 18 0.350850
18 ruvbl1 0.657634 19 0.342366
19 fosl2 0.661870 20 0.338130
20 faire 0.662093 21 0.337907
21 foxm1 0.666393 22 0.333607
22 esrra 0.669006 23 0.330994
23 cebpz 0.675203 24 0.324797
24 nipbl 0.682350 25 0.317650

cell-type-specific fine-tuned model

[ ]:
# # Download example data
# # Myoblast and fibroblast data: ATAC-seq bigWig and peak files
# import subprocess
# import os
# if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
#     subprocess.run(cmd, shell=True)
#     cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
#     subprocess.run(cmd, shell=True)

# if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
#     subprocess.run(cmd, shell=True)

### fine-tuned a cell-type-specific model
# from chrombert_tools import region_activity_regression
# results_myoblast_specific = region_activity_regression(
#     odir = "./output_cell_specific_emb_train", # output directory
#     cell_type_bw = "../data/myoblast_ENCFF149ERN_signal.bigwig", # your focus cell-type accessibility data
#     cell_type_peak = "../data/myoblast_ENCFF647RNC_peak.bed", # your focus cell-type peak data
#     genome = "hg38", # genome
#     resolution = "1kb", # resolution
# )

[ ]:
import glob
ft_ckpt_dir = "./output_cell_specific_emb_train/train/**/*.ckpt" # Use checkpoints from embed_region.ipynb if available; otherwise, run the code above first

ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt
'./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt'
[ ]:
factor_importance_rank_ft = run_key_regulator(
    region1_file="./output_cell_specific_emb_train/dataset/highly_accessible_region.csv", # your focus region group 1, generated from embed_region.ipynb
    region2_file="./output_cell_specific_emb_train/dataset/background_region.csv", # your focus region group 2, generated from embed_region.ipynb
    odir="./output_regulator_effects_between_region_groups_finetuned", # output directory
    genome="hg38", # genome
    resolution="1kb", # resolution
    batch_size=64, # batch size
    ft_ckpt = ft_ckpt # fine-tuned checkpoint
)
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 16/16 [00:17<00:00,  1.11s/it]
100%|██████████| 16/16 [00:18<00:00,  1.15s/it]
Identify key regulators across regions(top 25)
        factors  similarity  rank  embedding_shift
0         myod1    0.180347     1         0.819653
1          myf5    0.193729     2         0.806271
2          myog    0.254071     3         0.745929
3         tead1    0.288621     4         0.711379
4          yap1    0.289305     5         0.710695
5         tcf21    0.291978     6         0.708022
6          cbx6    0.356214     7         0.643786
7         ring1    0.380347     8         0.619653
8         foxo1    0.385972     9         0.614028
9   pax3-foxo1a    0.415852    10         0.584148
10      neurog2    0.423216    11         0.576784
11         chd4    0.440935    12         0.559065
12        dnase    0.447499    13         0.552501
13        nr3c1    0.448599    14         0.551401
14        kdm6b    0.461739    15         0.538261
15          rb1    0.471630    16         0.528370
16         rnf2    0.473779    17         0.526221
17         cbx7    0.486125    18         0.513875
18        pgbd3    0.492760    19         0.507240
19         pax7    0.513177    20         0.486823
20         cbx8    0.532263    21         0.467737
21        wwtr1    0.538589    22         0.461411
22        esco2    0.552305    23         0.447695
23        ep300    0.554074    24         0.445926
24        klf11    0.558791    25         0.441209
Finished
Used fine-tuned ChromBERT checkpoint: ./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_finetuned/results/factor_importance_rank.csv
[13]:
factor_importance_rank_ft
[13]:
factors similarity rank embedding_shift
0 myod1 0.180347 1 0.819653
1 myf5 0.193729 2 0.806271
2 myog 0.254071 3 0.745929
3 tead1 0.288621 4 0.711379
4 yap1 0.289305 5 0.710695
... ... ... ... ...
1067 lhx3 0.961406 1068 0.038594
1068 nono 0.961549 1069 0.038451
1069 znf26 0.962108 1070 0.037892
1070 fev 0.962952 1071 0.037048
1071 snapc4 0.963752 1072 0.036248

1072 rows × 4 columns