Interpret regulator important effects between region groups

The interpret_regulator_effects_between_region_groups command generates regulator embeddings for different region groups and calculates the embedding difference of each regulator between groups as 1 − cosine similarity. Regulators with larger embedding differences are considered more likely to be key regulators.

Note: The remaining examples show command-line usage only (bash).

For the Python API, see `examples/api/interpret_regulator_effects_between_region_groups.ipynb <../api/interpret_regulator_effects_between_region_groups.ipynb>`__.

If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.

For more details, please refer to the `interpret_regulator_effects_between_region_groups <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_regulator_effects_between_region_groups.html>`__ command documentation

[ ]:
### options parameter
!chrombert-tools interpret_regulator_effects_between_region_groups -h
Usage: chrombert-tools interpret_regulator_effects_between_region_groups
           [OPTIONS]

  Identify regulators that differ between two region sets via embedding shift.

Options:
  --region1-file FILE             Region 1 file.  [required]
  --region2-file FILE             Region 2 file.  [required]
  --ft-ckpt FILE                  Fine-tuned ChromBERT checkpoint. If
                                  provided, using this ckpt to generate
                                  embeddings.
  --ignore-regulator TEXT         Ignore regulator. Use ';' to separate
                                  multiple regulators.
  --gep                           Use GEP model (multi-flank-window). Default:
                                  False.
  --flank-window INTEGER          Flank window size for gep model.  [default:
                                  4]
  --genome [hg38|mm10]            Reference genome (hg38 or mm10).  [default:
                                  hg38]
  --resolution [200bp|1kb|2kb|4kb]
                                  ChromBERT resolution.  [default: 1kb]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --batch-size INTEGER            Batch size.  [default: 4]
  --model-config FILE             Model configuration file.
  --data-config FILE              Data configuration file.
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache directory (contains config/
                                  anno/ checkpoint/ etc).  [default:
                                  ~/.cache/chrombert/data]
  -h, --help                      Show this message and exit.

pre-trained model

[ ]:
%%bash
# --region1-file: focus region group 1
# --region2-file: focus region group 2
# --odir: output directory
# --genome: genome
# --resolution: resolution
# --batch-size: batch size
chrombert-tools interpret_regulator_effects_between_region_groups \
    --region1-file "../data/myoblast_ENCFF647RNC_peak_100.bed" \
    --region2-file "../data/CTCF_ENCFF664UGR_sample100.bed" \
    --odir "./output_regulator_effects_between_region_groups_pretrain" \
    --genome "hg38" \
    --resolution "1kb" \
    --batch-size 64
Region summary - total: 100, overlapping with ChromBERT: 101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
100%|██████████| 2/2 [00:05<00:00,  2.67s/it]
100%|██████████| 2/2 [00:03<00:00,  1.98s/it]
Identify key regulators across regions(top 25)
    factors  similarity  rank  embedding_shift
0      ctcf    0.363011     1         0.636989
1      smc3    0.439149     2         0.560851
2     rad21    0.494077     3         0.505923
3     zbtb2    0.495611     4         0.504389
4    znf654    0.496850     5         0.503150
5      mbd4    0.520199     6         0.479801
6     smc1a    0.543333     7         0.456667
7     stag1    0.546467     8         0.453533
8     drap1    0.598942     9         0.401058
9      mbd1    0.600407    10         0.399593
10   trim22    0.607737    11         0.392263
11    brca1    0.623139    12         0.376861
12    sumo1    0.633337    13         0.366663
13  l3mbtl4    0.641373    14         0.358627
14    kdm3b    0.642861    15         0.357139
15    tgif2    0.647627    16         0.352373
16     irf3    0.648460    17         0.351540
17   srebf2    0.649150    18         0.350850
18   ruvbl1    0.657634    19         0.342366
19    fosl2    0.661870    20         0.338130
20    faire    0.662093    21         0.337907
21    foxm1    0.666393    22         0.333607
22    esrra    0.669006    23         0.330994
23    cebpz    0.675203    24         0.324797
24    nipbl    0.682350    25         0.317650
Finished
Used pre-trained ChromBERT
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_pretrain/results/factor_importance_rank.csv
[3]:
import pandas as pd
factor_importance_rank = pd.read_csv("./output_regulator_effects_between_region_groups_pretrain/results/factor_importance_rank.csv")
factor_importance_rank
[3]:
factors similarity rank embedding_shift
0 ctcf 0.363011 1 0.636989
1 smc3 0.439149 2 0.560851
2 rad21 0.494077 3 0.505923
3 zbtb2 0.495611 4 0.504389
4 znf654 0.496850 5 0.503150
... ... ... ... ...
1067 egr3 0.963733 1068 0.036267
1068 irf5 0.964050 1069 0.035950
1069 stat4 0.965261 1070 0.034739
1070 znf600 0.966632 1071 0.033368
1071 ikzf2 0.970161 1072 0.029839

1072 rows × 4 columns

cell-type-specific fine-tuned model

[ ]:
# # Download example data
# # Myoblast and fibroblast data: ATAC-seq bigWig and peak files
# import subprocess
# import os
# if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
#     subprocess.run(cmd, shell=True)
#     cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
#     subprocess.run(cmd, shell=True)

# if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
#     subprocess.run(cmd, shell=True)

## fine-tuned a cell-type-specific model
# '''
# --odir: output directory
# --acc_signal1: cell-type-specific accessibility signal
# --acc_peak1: cell-type-specific peak
# --genome: genome
# --resolution: resolution
# '''
# !chrombert-tools region_activity_regression \
# --odir "./output_cell_specific_emb_train" \
# --acc_signal1 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
# --acc_peak1 "../data/myoblast_ENCFF647RNC_peak.bed" \
# --genome "hg38" \
# --resolution "1kb"
[ ]:
import glob
ft_ckpt_dir = "../api/output_cell_specific_emb_train/train/**/*.ckpt" # Use checkpoints from embed_region.ipynb if available; otherwise, run the code above first

ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt
'../api/output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt'
[ ]:
# --region1-file: focus region group 1
# --region2-file: focus region group 2
# --odir: output directory
# --genome: genome
# --resolution: resolution
# --batch-size: batch size
# --ft-ckpt: fine-tuned model checkpoint
!chrombert-tools interpret_regulator_effects_between_region_groups \
    --region1-file "./output_cell_specific_emb_train/dataset/highly_accessible_region.csv" \
    --region2-file "./output_cell_specific_emb_train/dataset/background_region.csv" \
    --odir "./output_regulator_effects_between_region_groups_finetuned" \
    --genome "hg38" \
    --resolution "1kb" \
    --batch-size 64 \
    --ft-ckpt {ft_ckpt}


Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ../api/output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|███████████████████████████████████████████| 16/16 [00:38<00:00,  2.39s/it]
100%|███████████████████████████████████████████| 16/16 [00:37<00:00,  2.33s/it]
Identify key regulators across regions(top 25)
        factors  similarity  rank  embedding_shift
0         myod1    0.180347     1         0.819653
1          myf5    0.193729     2         0.806271
2          myog    0.254071     3         0.745929
3         tead1    0.288621     4         0.711379
4          yap1    0.289305     5         0.710695
5         tcf21    0.291978     6         0.708022
6          cbx6    0.356214     7         0.643786
7         ring1    0.380347     8         0.619653
8         foxo1    0.385972     9         0.614028
9   pax3-foxo1a    0.415852    10         0.584148
10      neurog2    0.423216    11         0.576784
11         chd4    0.440935    12         0.559065
12        dnase    0.447499    13         0.552501
13        nr3c1    0.448599    14         0.551401
14        kdm6b    0.461739    15         0.538261
15          rb1    0.471630    16         0.528370
16         rnf2    0.473779    17         0.526221
17         cbx7    0.486125    18         0.513875
18        pgbd3    0.492760    19         0.507240
19         pax7    0.513177    20         0.486823
20         cbx8    0.532263    21         0.467737
21        wwtr1    0.538589    22         0.461411
22        esco2    0.552305    23         0.447695
23        ep300    0.554074    24         0.445926
24        klf11    0.558791    25         0.441209
Finished
Used fine-tuned ChromBERT checkpoint: ../api/output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Key regulators across regions saved to: ./output_regulator_effects_between_region_groups_finetuned/results/factor_importance_rank.csv
[7]:
factor_importance_rank_ft = pd.read_csv("./output_regulator_effects_between_region_groups_finetuned/results/factor_importance_rank.csv")
factor_importance_rank_ft
[7]:
factors similarity rank embedding_shift
0 myod1 0.180347 1 0.819653
1 myf5 0.193729 2 0.806271
2 myog 0.254071 3 0.745929
3 tead1 0.288621 4 0.711379
4 yap1 0.289305 5 0.710695
... ... ... ... ...
1067 lhx3 0.961406 1068 0.038594
1068 nono 0.961549 1069 0.038451
1069 znf26 0.962108 1070 0.037892
1070 fev 0.962952 1071 0.037048
1071 snapc4 0.963752 1072 0.036248

1072 rows × 4 columns