Interpret regulator important effects between region groups¶
The interpret_regulator_effects_between_region_groups command generates regulator embeddings for different region groups and calculates the embedding difference of each regulator between groups as 1 − cosine similarity. Regulators with larger embedding differences are considered more likely to be key regulators.
Note: The remaining examples show command-line usage only (bash).
For the Python API, see `examples/api/interpret_regulator_effects_between_region_groups.ipynb <../api/interpret_regulator_effects_between_region_groups.ipynb>`__.
If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.
For more details, please refer to the `interpret_regulator_effects_between_region_groups <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_regulator_effects_between_region_groups.html>`__ command documentation
[ ]:
### options parameter
!chrombert-tools interpret_regulator_effects_between_region_groups -h
Usage: chrombert-tools interpret_regulator_effects_between_region_groups
[OPTIONS]
Identify regulators that differ between two region sets via embedding shift.
Options:
--region1-file FILE Region 1 file. [required]
--region2-file FILE Region 2 file. [required]
--ft-ckpt FILE Fine-tuned ChromBERT checkpoint. If
provided, using this ckpt to generate
embeddings.
--ignore-regulator TEXT Ignore regulator. Use ';' to separate
multiple regulators.
--gep Use GEP model (multi-flank-window). Default:
False.
--flank-window INTEGER Flank window size for gep model. [default:
4]
--genome [hg38|mm10] Reference genome (hg38 or mm10). [default:
hg38]
--resolution [200bp|1kb|2kb|4kb]
ChromBERT resolution. [default: 1kb]
--odir DIRECTORY Output directory. [default: ./output]
--batch-size INTEGER Batch size. [default: 4]
--model-config FILE Model configuration file.
--data-config FILE Data configuration file.
--chrombert-cache-dir DIRECTORY
ChromBERT cache directory (contains config/
anno/ checkpoint/ etc). [default:
~/.cache/chrombert/data]
-h, --help Show this message and exit.
pre-trained model¶
[ ]:
%%bash
# --region1-file: focus region group 1
# --region2-file: focus region group 2
# --odir: output directory
# --genome: genome
# --resolution: resolution
# --batch-size: batch size
chrombert-tools interpret_regulator_effects_between_region_groups \
--region1-file "../data/myoblast_ENCFF647RNC_peak_100.bed" \
--region2-file "../data/CTCF_ENCFF664UGR_sample100.bed" \
--odir "./output_regulator_effects_between_region_groups_pretrain" \
--genome "hg38" \
--resolution "1kb" \
--batch-size 64
Region summary - total: 100, overlapping with ChromBERT: 101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
100%|██████████| 2/2 [00:05<00:00, 2.67s/it]
100%|██████████| 2/2 [00:03<00:00, 1.98s/it]
Identify key regulators across regions(top 25)
factors similarity rank embedding_shift
0 ctcf 0.363011 1 0.636989
1 smc3 0.439149 2 0.560851
2 rad21 0.494077 3 0.505923
3 zbtb2 0.495611 4 0.504389
4 znf654 0.496850 5 0.503150
5 mbd4 0.520199 6 0.479801
6 smc1a 0.543333 7 0.456667
7 stag1 0.546467 8 0.453533
8 drap1 0.598942 9 0.401058
9 mbd1 0.600407 10 0.399593
10 trim22 0.607737 11 0.392263
11 brca1 0.623139 12 0.376861
12 sumo1 0.633337 13 0.366663
13 l3mbtl4 0.641373 14 0.358627
14 kdm3b 0.642861 15 0.357139
15 tgif2 0.647627 16 0.352373
16 irf3 0.648460 17 0.351540
17 srebf2 0.649150 18 0.350850
18 ruvbl1 0.657634 19 0.342366
19 fosl2 0.661870 20 0.338130
20 faire 0.662093 21 0.337907
21 foxm1 0.666393 22 0.333607
22 esrra 0.669006 23 0.330994
23 cebpz 0.675203 24 0.324797
24 nipbl 0.682350 25 0.317650
Finished
Used pre-trained ChromBERT
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_pretrain/results/factor_importance_rank.csv
[3]:
import pandas as pd
factor_importance_rank = pd.read_csv("./output_regulator_effects_between_region_groups_pretrain/results/factor_importance_rank.csv")
factor_importance_rank
[3]:
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | ctcf | 0.363011 | 1 | 0.636989 |
| 1 | smc3 | 0.439149 | 2 | 0.560851 |
| 2 | rad21 | 0.494077 | 3 | 0.505923 |
| 3 | zbtb2 | 0.495611 | 4 | 0.504389 |
| 4 | znf654 | 0.496850 | 5 | 0.503150 |
| ... | ... | ... | ... | ... |
| 1067 | egr3 | 0.963733 | 1068 | 0.036267 |
| 1068 | irf5 | 0.964050 | 1069 | 0.035950 |
| 1069 | stat4 | 0.965261 | 1070 | 0.034739 |
| 1070 | znf600 | 0.966632 | 1071 | 0.033368 |
| 1071 | ikzf2 | 0.970161 | 1072 | 0.029839 |
1072 rows × 4 columns
cell-type-specific fine-tuned model¶
[ ]:
# # Download example data
# # Myoblast and fibroblast data: ATAC-seq bigWig and peak files
# import subprocess
# import os
# if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
# cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
# subprocess.run(cmd, shell=True)
# cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
# subprocess.run(cmd, shell=True)
# if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
# cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
# subprocess.run(cmd, shell=True)
## fine-tuned a cell-type-specific model
# '''
# --odir: output directory
# --acc_signal1: cell-type-specific accessibility signal
# --acc_peak1: cell-type-specific peak
# --genome: genome
# --resolution: resolution
# '''
# !chrombert-tools region_activity_regression \
# --odir "./output_cell_specific_emb_train" \
# --acc_signal1 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
# --acc_peak1 "../data/myoblast_ENCFF647RNC_peak.bed" \
# --genome "hg38" \
# --resolution "1kb"
[ ]:
import glob
ft_ckpt_dir = "../api/output_cell_specific_emb_train/train/**/*.ckpt" # Use checkpoints from embed_region.ipynb if available; otherwise, run the code above first
ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt
'../api/output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt'
[ ]:
# --region1-file: focus region group 1
# --region2-file: focus region group 2
# --odir: output directory
# --genome: genome
# --resolution: resolution
# --batch-size: batch size
# --ft-ckpt: fine-tuned model checkpoint
!chrombert-tools interpret_regulator_effects_between_region_groups \
--region1-file "./output_cell_specific_emb_train/dataset/highly_accessible_region.csv" \
--region2-file "./output_cell_specific_emb_train/dataset/background_region.csv" \
--odir "./output_regulator_effects_between_region_groups_finetuned" \
--genome "hg38" \
--resolution "1kb" \
--batch-size 64 \
--ft-ckpt {ft_ckpt}
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ../api/output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|███████████████████████████████████████████| 16/16 [00:38<00:00, 2.39s/it]
100%|███████████████████████████████████████████| 16/16 [00:37<00:00, 2.33s/it]
Identify key regulators across regions(top 25)
factors similarity rank embedding_shift
0 myod1 0.180347 1 0.819653
1 myf5 0.193729 2 0.806271
2 myog 0.254071 3 0.745929
3 tead1 0.288621 4 0.711379
4 yap1 0.289305 5 0.710695
5 tcf21 0.291978 6 0.708022
6 cbx6 0.356214 7 0.643786
7 ring1 0.380347 8 0.619653
8 foxo1 0.385972 9 0.614028
9 pax3-foxo1a 0.415852 10 0.584148
10 neurog2 0.423216 11 0.576784
11 chd4 0.440935 12 0.559065
12 dnase 0.447499 13 0.552501
13 nr3c1 0.448599 14 0.551401
14 kdm6b 0.461739 15 0.538261
15 rb1 0.471630 16 0.528370
16 rnf2 0.473779 17 0.526221
17 cbx7 0.486125 18 0.513875
18 pgbd3 0.492760 19 0.507240
19 pax7 0.513177 20 0.486823
20 cbx8 0.532263 21 0.467737
21 wwtr1 0.538589 22 0.461411
22 esco2 0.552305 23 0.447695
23 ep300 0.554074 24 0.445926
24 klf11 0.558791 25 0.441209
Finished
Used fine-tuned ChromBERT checkpoint: ../api/output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_finetuned/results/factor_importance_rank.csv
[7]:
factor_importance_rank_ft = pd.read_csv("./output_regulator_effects_between_region_groups_finetuned/results/factor_importance_rank.csv")
factor_importance_rank_ft
[7]:
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | myod1 | 0.180347 | 1 | 0.819653 |
| 1 | myf5 | 0.193729 | 2 | 0.806271 |
| 2 | myog | 0.254071 | 3 | 0.745929 |
| 3 | tead1 | 0.288621 | 4 | 0.711379 |
| 4 | yap1 | 0.289305 | 5 | 0.710695 |
| ... | ... | ... | ... | ... |
| 1067 | lhx3 | 0.961406 | 1068 | 0.038594 |
| 1068 | nono | 0.961549 | 1069 | 0.038451 |
| 1069 | znf26 | 0.962108 | 1070 | 0.037892 |
| 1070 | fev | 0.962952 | 1071 | 0.037048 |
| 1071 | snapc4 | 0.963752 | 1072 | 0.036248 |
1072 rows × 4 columns