Interpret key regulators between region groups¶

This notebook demonstrates how to identify key regulators between region groups using the ChromBERT-tools Python API.

The interpret_regulator_effects_between_region_groups API generates regulator embeddings for different region groups and calculates the embedding difference of each regulator between groups as 1 − cosine similarity. Regulators with larger embedding differences are considered more likely to be key regulators.

For the bash command-line usage, see `examples/cli/interpret_regulator_effects_between_region_groups.ipynb <../cli/interpret_regulator_effects_between_region_groups.ipynb>`__.

For more details, please refer to the `interpret_regulator_effects_between_region_groups <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_regulator_effects_between_region_groups.html>`__ command documentation

[3]:

from chrombert_tools import interpret_regulator_effects_between_region_groups as run_key_regulator

pre-trained¶

[5]:

factor_importance_rank = run_key_regulator(
    region1_file="../data/myoblast_ENCFF647RNC_peak_100.bed", # your focus region group 1
    region2_file="../data/CTCF_ENCFF664UGR_sample100.bed", # your focus region group 2
    odir="./output_regulator_effects_between_region_groups_pretrain", # output directory
    genome="hg38", # genome
    resolution="1kb", # resolution
    batch_size=64, # batch size
)

Region summary - total: 100, overlapping with ChromBERT: 101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.

100%|██████████| 2/2 [00:04<00:00,  2.43s/it]
100%|██████████| 2/2 [00:04<00:00,  2.03s/it]

Identify key regulators across regions(top 25)
    factors  similarity  rank  embedding_shift
0      ctcf    0.363011     1         0.636989
1      smc3    0.439149     2         0.560851
2     rad21    0.494077     3         0.505923
3     zbtb2    0.495611     4         0.504389
4    znf654    0.496850     5         0.503150
5      mbd4    0.520199     6         0.479801
6     smc1a    0.543333     7         0.456667
7     stag1    0.546467     8         0.453533
8     drap1    0.598942     9         0.401058
9      mbd1    0.600407    10         0.399593
10   trim22    0.607737    11         0.392263
11    brca1    0.623139    12         0.376861
12    sumo1    0.633337    13         0.366663
13  l3mbtl4    0.641373    14         0.358627
14    kdm3b    0.642861    15         0.357139
15    tgif2    0.647627    16         0.352373
16     irf3    0.648460    17         0.351540
17   srebf2    0.649150    18         0.350850
18   ruvbl1    0.657634    19         0.342366
19    fosl2    0.661870    20         0.338130
20    faire    0.662093    21         0.337907
21    foxm1    0.666393    22         0.333607
22    esrra    0.669006    23         0.330994
23    cebpz    0.675203    24         0.324797
24    nipbl    0.682350    25         0.317650
Finished
Used pre-trained ChromBERT
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_pretrain/results/factor_importance_rank.csv

[6]:

factor_importance_rank.head(n=25)

[6]:

	factors	similarity	rank	embedding_shift
0	ctcf	0.363011	1	0.636989
1	smc3	0.439149	2	0.560851
2	rad21	0.494077	3	0.505923
3	zbtb2	0.495611	4	0.504389
4	znf654	0.496850	5	0.503150
5	mbd4	0.520199	6	0.479801
6	smc1a	0.543333	7	0.456667
7	stag1	0.546467	8	0.453533
8	drap1	0.598942	9	0.401058
9	mbd1	0.600407	10	0.399593
10	trim22	0.607737	11	0.392263
11	brca1	0.623139	12	0.376861
12	sumo1	0.633337	13	0.366663
13	l3mbtl4	0.641373	14	0.358627
14	kdm3b	0.642861	15	0.357139
15	tgif2	0.647627	16	0.352373
16	irf3	0.648460	17	0.351540
17	srebf2	0.649150	18	0.350850
18	ruvbl1	0.657634	19	0.342366
19	fosl2	0.661870	20	0.338130
20	faire	0.662093	21	0.337907
21	foxm1	0.666393	22	0.333607
22	esrra	0.669006	23	0.330994
23	cebpz	0.675203	24	0.324797
24	nipbl	0.682350	25	0.317650

cell-type-specific fine-tuned model¶

[ ]:

# # Download example data
# # Myoblast and fibroblast data: ATAC-seq bigWig and peak files
# import subprocess
# import os
# if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
#     subprocess.run(cmd, shell=True)
#     cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
#     subprocess.run(cmd, shell=True)

# if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
#     subprocess.run(cmd, shell=True)

### fine-tuned a cell-type-specific model
# from chrombert_tools import region_activity_regression
# results_myoblast_specific = region_activity_regression(
#     odir = "./output_cell_specific_emb_train", # output directory
#     cell_type_bw = "../data/myoblast_ENCFF149ERN_signal.bigwig", # your focus cell-type accessibility data
#     cell_type_peak = "../data/myoblast_ENCFF647RNC_peak.bed", # your focus cell-type peak data
#     genome = "hg38", # genome
#     resolution = "1kb", # resolution
# )

[ ]:

import glob
ft_ckpt_dir = "./output_cell_specific_emb_train/train/**/*.ckpt" # Use checkpoints from embed_region.ipynb if available; otherwise, run the code above first

ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt

'./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt'

[ ]:

factor_importance_rank_ft = run_key_regulator(
    region1_file="./output_cell_specific_emb_train/dataset/highly_accessible_region.csv", # your focus region group 1, generated from embed_region.ipynb
    region2_file="./output_cell_specific_emb_train/dataset/background_region.csv", # your focus region group 2, generated from embed_region.ipynb
    odir="./output_regulator_effects_between_region_groups_finetuned", # output directory
    genome="hg38", # genome
    resolution="1kb", # resolution
    batch_size=64, # batch size
    ft_ckpt = ft_ckpt # fine-tuned checkpoint
)

Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters

100%|██████████| 16/16 [00:17<00:00,  1.11s/it]
100%|██████████| 16/16 [00:18<00:00,  1.15s/it]

Identify key regulators across regions(top 25)
        factors  similarity  rank  embedding_shift
0         myod1    0.180347     1         0.819653
1          myf5    0.193729     2         0.806271
2          myog    0.254071     3         0.745929
3         tead1    0.288621     4         0.711379
4          yap1    0.289305     5         0.710695
5         tcf21    0.291978     6         0.708022
6          cbx6    0.356214     7         0.643786
7         ring1    0.380347     8         0.619653
8         foxo1    0.385972     9         0.614028
9   pax3-foxo1a    0.415852    10         0.584148
10      neurog2    0.423216    11         0.576784
11         chd4    0.440935    12         0.559065
12        dnase    0.447499    13         0.552501
13        nr3c1    0.448599    14         0.551401
14        kdm6b    0.461739    15         0.538261
15          rb1    0.471630    16         0.528370
16         rnf2    0.473779    17         0.526221
17         cbx7    0.486125    18         0.513875
18        pgbd3    0.492760    19         0.507240
19         pax7    0.513177    20         0.486823
20         cbx8    0.532263    21         0.467737
21        wwtr1    0.538589    22         0.461411
22        esco2    0.552305    23         0.447695
23        ep300    0.554074    24         0.445926
24        klf11    0.558791    25         0.441209
Finished
Used fine-tuned ChromBERT checkpoint: ./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_finetuned/results/factor_importance_rank.csv

[13]:

factor_importance_rank_ft

[13]:

	factors	similarity	rank	embedding_shift
0	myod1	0.180347	1	0.819653
1	myf5	0.193729	2	0.806271
2	myog	0.254071	3	0.745929
3	tead1	0.288621	4	0.711379
4	yap1	0.289305	5	0.710695
...	...	...	...	...
1067	lhx3	0.961406	1068	0.038594
1068	nono	0.961549	1069	0.038451
1069	znf26	0.962108	1070	0.037892
1070	fev	0.962952	1071	0.037048
1071	snapc4	0.963752	1072	0.036248

1072 rows × 4 columns