Interpret regulator-regulator interactions

The interpret_regulator_regulator_interactions subcommand generates context-aware regulator embeddings and calculates the cosine similarity of each regulator pair. Regulators with higher cosine similarity are considered more likely to interact.

Note: The remaining examples show command-line usage only (bash).

For the Python API, see `examples/api/interpret_regulator_regulator_interactions.ipynb <../api/interpret_regulator_regulator_interactions.ipynb>`__.

If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.

For more details, please refer to the `interpret_regulator_regulator_interactions <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_regulator_regulator_interactions.html>`__ command documentation

[ ]:
### options parameter
!chrombert-tools interpret_regulator_regulator_interactions -h
Usage: chrombert-tools interpret_regulator_regulator_interactions
           [OPTIONS]

  Interpret regulator-regulator interactions

Options:
  --region FILE                   Region BED file (focus regions).  [required]
  --regulator TEXT                Optional. Regulators to plot subnetworks,
                                  e.g. EZH2;BRD4;CTCF. Use ';' to separate.
  --odir DIRECTORY                Output directory.  [default: ./output]
  --ft-ckpt FILE                  Fine-tuned ChromBERT checkpoint. If
                                  provided, using this ckpt to generate
                                  embeddings.
  --ignore-regulator TEXT         Ignore regulator(s) Use ';' to separate
                                  multiple names. If provided, will ignore
                                  these regulators.
  --gep                           Use GEP model (multi-flank-window). Default:
                                  False.
  --flank-window INTEGER          Flank window size for GEP model.  [default:
                                  4]
  --genome [hg38|mm10]            Genome.  [default: hg38]
  --resolution [1kb|200bp|2kb|4kb]
                                  Resolution.  [default: 1kb]
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache dir (contains config/ and
                                  checkpoint/ etc).  [default:
                                  ~/.cache/chrombert/data]
  --batch-size INTEGER            Batch size for region dataloader.  [default:
                                  64]
  --quantile FLOAT                Quantile threshold for cosine similarity
                                  edges.  [default: 0.98]
  --k-hop INTEGER                 k-hop radius for subnetwork plotting.
                                  [default: 1]
  --model-config FILE             Model configuration file.
  --data-config FILE              Data configuration file.
  -h, --help                      Show this message and exit.

Pre-trained model

[ ]:
# Infer regulator-regulator interactions across focus regions
# --region: focus regions
# --regulator: focus regulators
# --odir: output directory
# --genome: genome
# --resolution: resolution
!chrombert-tools interpret_regulator_regulator_interactions \
    --region ../data/CTCF_ENCFF664UGR_sample100.bed \
    --regulator "ctcf;nanog;ezh2" \
    --odir ./output_regulator_network \
    --genome "hg38" \
    --resolution "1kb"
Region summary - total: 100, overlapping with ChromBERT: 100 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 3, matched in ChromBERT: 3, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
100%|█████████████████████████████████████████████| 2/2 [00:03<00:00,  1.59s/it]
Total graph nodes: 951
Total graph edges (threshold=0.636): 11503
Regulator subnetwork saved to: ./output_regulator_network/subnetwork_ezh2_k1_q0.980_thr0.636.pdf
Regulator subnetwork saved to: ./output_regulator_network/subnetwork_ctcf_k1_q0.980_thr0.636.pdf
Regulator subnetwork saved to: ./output_regulator_network/subnetwork_nanog_k1_q0.980_thr0.636.pdf
Finished!
Saved outputs to: ./output_regulator_network
Regulator cosine similarity saved to: ./output_regulator_network/regulator_cosine_similarity.tsv
Total graph edges saved to: ./output_regulator_network/total_graph_edge_threshold0.636_quantile0.980.tsv
[4]:
# regulator_cosine_similarity.tsv: cosine similarity matrix of regulators on this region
# total_graph_edge_*.tsv: edges in the regulatory network where cosine similarity >= threshold
# subnetwork_*.pdf: subnetworks for specified regulators
import pandas as pd
regulator_cosine_similarity = pd.read_csv("./output_regulator_network/regulator_cosine_similarity.tsv", sep="\t",index_col=0)
total_graph_edge = pd.read_csv("./output_regulator_network/total_graph_edge_threshold0.636_quantile0.980.tsv", sep="\t")


[ ]:
# cosine similarity matrix of regulators on this region
regulator_cosine_similarity
5hmc adnp aebp2 aff1 aff4 ago1 ago2 ahr ahrr alkbh3 ... zscan20 zscan22 zscan23 zscan29 zscan31 zscan5a zta zxdb zxdc zzz3
5hmc 1.000000 0.161553 0.285241 0.158628 0.117248 0.127353 0.164635 0.140008 0.140390 0.256362 ... 0.343546 0.136590 0.344879 0.193269 0.168963 0.255532 0.340011 0.150076 0.061059 0.330447
adnp 0.161553 1.000000 0.587140 0.387827 0.471895 0.130505 0.207243 0.277108 0.308542 0.250292 ... 0.399306 0.333286 0.455049 0.514076 0.365677 0.465939 0.225964 0.436089 0.300675 0.241342
aebp2 0.285241 0.587140 1.000000 0.308597 0.402976 0.124346 0.206790 0.248920 0.429926 0.295569 ... 0.407240 0.224415 0.319738 0.286058 0.308937 0.247846 0.316289 0.215994 0.166821 0.273573
aff1 0.158628 0.387827 0.308597 1.000000 0.681266 0.235524 0.285841 0.336590 0.390974 0.265273 ... 0.386461 0.306672 0.318689 0.370916 0.413583 0.343913 0.262005 0.297290 0.231193 0.262453
aff4 0.117248 0.471895 0.402976 0.681266 1.000000 0.253977 0.326415 0.329043 0.368464 0.319714 ... 0.380179 0.447794 0.403113 0.396646 0.423483 0.385116 0.332274 0.390634 0.394011 0.287089
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zscan5a 0.255532 0.465939 0.247846 0.343913 0.385116 0.259383 0.276472 0.326592 0.212140 0.211508 ... 0.757539 0.434079 0.870642 0.482686 0.472391 1.000000 0.179671 0.647394 0.424395 0.084673
zta 0.340011 0.225964 0.316289 0.262005 0.332274 0.036716 0.130013 0.184514 0.272651 0.305709 ... 0.270659 0.260485 0.195705 0.184984 0.326114 0.179671 1.000000 0.087392 0.076405 0.419953
zxdb 0.150076 0.436089 0.215994 0.297290 0.390634 0.309995 0.294769 0.320665 0.209873 0.151475 ... 0.573041 0.541771 0.619564 0.468207 0.393382 0.647394 0.087392 1.000000 0.497639 0.033969
zxdc 0.061059 0.300675 0.166821 0.231193 0.394011 0.333499 0.287851 0.590078 0.343713 0.153941 ... 0.304457 0.302000 0.406280 0.343347 0.191912 0.424395 0.076405 0.497639 1.000000 0.017482
zzz3 0.330447 0.241342 0.273573 0.262453 0.287089 -0.040804 0.054807 0.071938 0.219917 0.250210 ... 0.204509 0.221356 0.077926 0.130979 0.319954 0.084673 0.419953 0.033969 0.017482 1.000000

1073 rows × 1073 columns

[ ]:
# edges in the regulatory network where cosine similarity >= threshold
total_graph_edge
node1 node2 cosine_similarity
0 5hmc brdu 0.701982
1 5hmc rloop 0.756476
2 5hmc sirt1 0.664322
3 5hmc znf823 0.641759
4 adnp atf5 0.710570
... ... ... ...
11498 zscan20 zscan23 0.739037
11499 zscan20 zscan5a 0.757539
11500 zscan22 zscan31 0.712420
11501 zscan23 zscan5a 0.870642
11502 zscan5a zxdb 0.647394

11503 rows × 3 columns

[ ]:
# subnetworks for ctcf
df_edges_ctcf = total_graph_edge.query("node1 == 'ctcf'")
df_edges_ctcf

node1 node2 cosine_similarity
1682 ctcf dnase 0.704624
1683 ctcf kdm5b 0.658326
1684 ctcf rad21 0.851742
1685 ctcf smc1a 0.844970
1686 ctcf smc3 0.856302
1687 ctcf srf 0.656536
1688 ctcf stag1 0.865400
1689 ctcf sumo1 0.663740
1690 ctcf trim22 0.642802
1691 ctcf zbtb2 0.898986
1692 ctcf znf654 0.731698
[ ]:
# subnetworks for ctcf
df_edges_nanog = total_graph_edge.query("node1 == 'nanog'")
df_edges_nanog

node1 node2 cosine_similarity
5293 nanog pou5f1 0.755285
5294 nanog smad2 0.660624
5295 nanog sox2 0.749701
5296 nanog tal1 0.636584
5297 nanog tbxt 0.686592
[ ]:
# subnetworks for ctcf
df_edges_ezh2 = total_graph_edge.query("node1 == 'ezh2'")
df_edges_ezh2

node1 node2 cosine_similarity
2715 ezh2 h3k27me3 0.751440
2716 ezh2 hinfp 0.694675
2717 ezh2 hsf1 0.656183
2718 ezh2 junb 0.649472
2719 ezh2 med1 0.638378
2720 ezh2 npat 0.655375
2721 ezh2 pcgf2 0.677134
2722 ezh2 polr3g 0.669339
2723 ezh2 rnf2 0.716335
2724 ezh2 stat3 0.652700
2725 ezh2 suz12 0.809129
2726 ezh2 tp53 0.662590
2727 ezh2 ubtf 0.673707
[ ]:
# subnetworks for myf5
df_edges_myf5= total_graph_edge.query("node1 == 'myod1' or node2 == 'myod1'")
df_edges_myf5

node1 node2 cosine_similarity
294 ascl1 myod1 0.713746
5248 myod1 myog 0.643061
5249 myod1 neurog2 0.704578
5250 myod1 tcf21 0.664360
5251 myod1 zbtb42 0.637871

cell-type-specific (myoblast)

[ ]:
# # Download example data
# # Myoblast and fibroblast data: ATAC-seq bigWig and peak files
# import subprocess
# import os
# if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
#     subprocess.run(cmd, shell=True)
#     cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
#     subprocess.run(cmd, shell=True)

# if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
#     subprocess.run(cmd, shell=True)

## fine-tuned a cell-type-specific model
# '''
# --odir: output directory
# --acc_signal1: cell-type-specific accessibility signal
# --acc_peak1: cell-type-specific peak
# --genome: genome
# --resolution: resolution
# '''
# !chrombert-tools region_activity_regression \
# --odir "./output_cell_specific_emb_train" \
# --acc_signal1 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
# --acc_peak1 "../data/myoblast_ENCFF647RNC_peak.bed" \
# --genome "hg38" \
# --resolution "1kb"
[ ]:
import glob
ft_ckpt_dir = "../api/output_cell_specific_emb_train/train/**/*.ckpt" # Use checkpoints from embed_region.ipynb if available; otherwise, run the code above first

ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt
'../api/output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt'
[ ]:
# Infer regulator-regulator interactions across focus regions
# --region: focus regions
# --regulator: focus regulators
# --odir: output directory
# --ft-ckpt: fine-tuned checkpoint
# --genome: genome
# --resolution: resolution

!chrombert-tools interpret_regulator_regulator_interactions \
    --region ../data/myoblast_ENCFF647RNC_peak_100.bed \
    --regulator "myod1;ctcf;nanog;ezh2" \
    --odir ./output_regulator_network_ft \
    --ft-ckpt {ft_ckpt} \
    --genome "hg38" \
    --resolution "1kb"

Region summary - total: 100, overlapping with ChromBERT: 101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 4, matched in ChromBERT: 4, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ../api/output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
100%|█████████████████████████████████████████████| 2/2 [00:02<00:00,  1.39s/it]
Total graph nodes: 924
Total graph edges (threshold=0.648): 11503
Regulator subnetwork saved to: ./output_regulator_network_ft/subnetwork_nanog_k1_q0.980_thr0.648.pdf
Regulator subnetwork saved to: ./output_regulator_network_ft/subnetwork_ezh2_k1_q0.980_thr0.648.pdf
Regulator subnetwork saved to: ./output_regulator_network_ft/subnetwork_myod1_k1_q0.980_thr0.648.pdf
Regulator subnetwork saved to: ./output_regulator_network_ft/subnetwork_ctcf_k1_q0.980_thr0.648.pdf
Finished!
Saved outputs to: ./output_regulator_network_ft
Regulator cosine similarity saved to: ./output_regulator_network_ft/regulator_cosine_similarity.tsv
Total graph edges saved to: ./output_regulator_network_ft/total_graph_edge_threshold0.648_quantile0.980.tsv
[18]:
# regulator_cosine_similarity.tsv: cosine similarity matrix of regulators on this region
# total_graph_edge_*.tsv: edges in the regulatory network where cosine similarity >= threshold
# subnetwork_*.pdf: subnetworks for specified regulators
import pandas as pd
regulator_cosine_similarity = pd.read_csv("./output_regulator_network_ft/regulator_cosine_similarity.tsv", sep="\t",index_col=0)
total_graph_edge = pd.read_csv("./output_regulator_network_ft/total_graph_edge_threshold0.648_quantile0.980.tsv", sep="\t")


[20]:
total_graph_edge.query("node1 == 'myod1' or node2 == 'myod1'")
[20]:
node1 node2 cosine_similarity
311 ascl1 myod1 0.714556
5225 myod1 myog 0.666915
5226 myod1 neurog2 0.678363
5227 myod1 tcf21 0.671968
[ ]: