Interpret region–region interactions

This notebook demonstrates how to infer region–region interactions using the ChromBERT-tools Python API.

interpret_region_region_interactions API: Generate region embeddings, calculate region–region embedding similarities, and infer region–region interactions.

For the bash command-line usage, see `examples/cli/interpret_region_region_interactions.ipynb <../cli/interpret_region_region_interactions.ipynb>`__.

For more details, please refer to the `interpret_region_region_interactions <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_region_region_interactions.html>`__ command documentation

[10]:
from chrombert_tools import interpret_region_region_interactions
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # GPU device

infer region-region interactions (enhancer-promoter loop; only by pretrained chrombert)

[2]:
# Returns:
# tss_region_pairs_cos: cosine similarity matrix of region–region representations; indicates interaction strength between enhancer and promoter regions.


tss_region_pairs_cos = interpret_region_region_interactions(
    region='../data/hESC_GSM2386582_ATAC.bed', # your focus enhancer region
    odir="./output_infer_ep", # output directory
    genome="hg38",                      # Options: "hg38", "mm10"
    resolution="1kb",                   # Options: "1kb", "2kb", "4kb", "200bp"
    filter_gene_name="RNVU1-15", # Focus on the ZNF879 promoter; otherwise, consider all genes.
)
Region summary - total: 54408, overlapping with ChromBERT: 56692 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 496
  Gene filter: kept 1/55240 TSS rows (gene_name in [1 names], gene_id in [0 ids])
  Gene filter: kept 5490/56692 region1 (BED) rows on 1 chromosome(s) matching the selected gene(s)
Finished!
Enhancer-promoter style pairs saved to: ./output_infer_ep/tss_region_pairs_cos.tsv
[3]:
tss_region_pairs_cos
[3]:
chrom gene_id gene_name tss tss_build_region_index distal_region_start distal_region_end distal_region_build_region_index dist dist_bin cos_sim
0 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144546000 144547000 99004 133424 79 0.966797
1 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144551000 144552000 99008 138424 83 0.910645
2 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144560000 144561000 99015 147424 90 0.794922
3 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144461000 144462000 98949 48424 24 0.699219
4 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144419000 144420000 98930 6424 5 0.688477
5 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144524000 144525000 98985 111424 60 0.651367
6 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144567000 144568000 99021 154424 96 0.431885
7 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144490000 144491000 98966 77424 41 0.307373

infer enhancer-promoter interactions (celltype-specific fine-tuned model)

[4]:
import glob
ft_ckpt_dir = "./output_cell_specific_emb_train/train/**/*.ckpt" # Path pattern for fine-tuned model checkpoints from embed_region.ipynb

ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt
[4]:
'./output_cell_specific_emb_train/train_old/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt'
[5]:
# Returns:
# tss_region_pairs_cos: cosine similarity matrix of region–region representations; indicates interaction strength between enhancer and promoter regions.


tss_region_pairs_cos = interpret_region_region_interactions(
    region='../data/myoblast_ENCFF647RNC_peak.bed', # your focus enhancer region
    odir="./output_infer_ep_myoblast_specific", # output directory
    genome="hg38",                      # Options: "hg38", "mm10"
    resolution="1kb",                   # Options: "1kb", "2kb", "4kb", "200bp"
    ft_ckpt=ft_ckpt,
    batch_size=64,
    filter_gene_name="RNVU1-15", # Focus on the ZNF879 promoter; otherwise, consider all genes.
)
Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
  Gene filter: kept 1/55240 TSS rows (gene_name in [1 names], gene_id in [0 ids])
  Gene filter: kept 33479/368260 region1 (BED) rows on 1 chromosome(s) matching the selected gene(s)
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_cell_specific_emb_train/train_old/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 461/461 [07:37<00:00,  1.01it/s]
Finished!
Enhancer-promoter style pairs saved to: ./output_infer_ep_myoblast_specific/tss_region_pairs_cos.tsv
[6]:
tss_region_pairs_cos
[6]:
chrom gene_id gene_name tss tss_build_region_index distal_region_start distal_region_end distal_region_build_region_index dist dist_bin cos_sim
0 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144546000 144547000 99004 133424 79 0.974380
1 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144546000 144547000 99004 133424 79 0.974380
2 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144552000 144553000 99009 139424 84 0.949564
3 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144551000 144552000 99008 138424 83 0.934053
4 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144560000 144561000 99015 147424 90 0.799996
5 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144524000 144525000 98985 111424 60 0.676938
6 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144413000 144414000 98926 424 1 0.670687
7 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144550000 144551000 99007 137424 82 0.632238
8 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144419000 144420000 98930 6424 5 0.625515
9 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144419000 144420000 98930 6424 5 0.625515
10 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144547000 144548000 99005 134424 80 0.613987
11 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144523000 144524000 98984 110424 59 0.595567
12 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144461000 144462000 98949 48424 24 0.580753
13 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144461000 144462000 98949 48424 24 0.580753
14 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144545000 144546000 99003 132424 78 0.529481
15 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144519000 144520000 98980 106424 55 0.287048
16 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144518000 144519000 98979 105424 54 0.285937
17 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144518000 144519000 98979 105424 54 0.285937
18 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144518000 144519000 98979 105424 54 0.285937
19 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144518000 144519000 98979 105424 54 0.285937
20 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144494000 144495000 98968 81424 43 0.273922
21 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144522000 144523000 98983 109424 58 0.258144
22 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144520000 144521000 98981 107424 56 0.255417
23 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144490000 144491000 98966 77424 41 0.246384
24 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144521000 144522000 98982 108424 57 0.245028
25 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144489000 144490000 98965 76424 40 0.245018
26 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144517000 144518000 98978 104424 53 0.239847
27 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144470000 144471000 98953 57424 28 0.198687
28 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144502000 144503000 98971 89424 46 0.189062

infer region-region interactions (two region groups)

[2]:
from chrombert_tools import resolve_paths
chrombert_anno_files = resolve_paths(genome="hg38", resolution="1kb",chrombert_cache_dir="~/.cache/chrombert/data")
[3]:
chrombert_anno_files
[3]:
{'chrombert_cache_dir': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data',
 'genome': 'hg38',
 'resolution': '1kb',
 'chrombert_region_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_1kb_region.bed',
 'chrombert_regulator_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt',
 'chrombert_factor_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_factors_list.txt',
 'hdf5_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/hg38_6k_1kb.hdf5',
 'pretrain_ckpt': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt',
 'mtx_mask': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_mask_matrix.tsv',
 'region_emb_npy': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/anno/hg38_1kb_region_emb.npy',
 'gene_meta_tsv': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/anno/hg38_1kb_gene_meta.tsv',
 'base_ca_signal': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/anno/hg38_1kb_accessibility_signal_mean.npy',
 'meta_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.json',
 'prompt_ckpt': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_prompt_cistrome.ckpt'}
[4]:
import pandas as pd
gene_meta = pd.read_csv(chrombert_anno_files["gene_meta_tsv"],sep='\t')
gene_meta.head()
[4]:
chrom loc1 loc2 strand tss gene_id gene_name gene_biotype start end build_region_index
0 chr1 182696 184174 + 182696 ENSG00000279928 DDX11L17 unprocessed_pseudogene 182000 183000 40
1 chr1 2581560 2584533 + 2581560 ENSG00000228037 NaN lncRNA 2581000 2582000 1762
2 chr1 3069168 3438621 + 3069168 ENSG00000142611 PRDM16 protein_coding 3069000 3070000 2125
3 chr1 5301928 5307394 - 5307394 ENSG00000284616 NaN lncRNA 5307000 5308000 3910
4 chr1 2403964 2413797 - 2413797 ENSG00000157911 PEX10 protein_coding 2413000 2414000 1600
[ ]:
all_chr_data = pd.read_csv("../data/hESC_GSM2386582_ATAC.bed",sep="\t",header=None)
all_chr_data[all_chr_data[0]=='chr1'].to_csv("../data/hESC_GSM2386582_ATAC_chr1.bed",sep="\t",header=None,index=False)

[ ]:
# Returns:
# region_region_pairs_cos: cosine similarity matrix of region–region representations; indicates interaction strength between region-group 1 and region-group 2.


region_region_pairs_cos = interpret_region_region_interactions(
    region='../data/hESC_GSM2386582_ATAC_chr1.bed', # your focus region group 1
    region2=chrombert_anno_files["gene_meta_tsv"], # your focus region group 2
    odir="./output_region_region_interactions", # output directory
    genome="hg38",                      # Options: "hg38", "mm10"
    resolution="1kb",                   # Options: "1kb", "2kb", "4kb", "200bp"
)
Region summary - total: 5262, overlapping with ChromBERT: 5490 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 33
Region summary - total: 55240, overlapping with ChromBERT: 55240 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Finished!
Set1 x set2 region-pair cosines (same chrom, 0 <= genomic_dist_bp <= 250000) saved to: ./output_region_region_interactions/region_set_pairs_cos.tsv
[7]:
region_region_pairs_cos
[7]:
set1_chrom set1_start set1_end set1_build_region_index set2_chrom set2_start set2_end set2_build_region_index genomic_dist_bp cos_sim
0 chr1 10000 11000 0 chr1 17000 18000 2 6000 0.609604
1 chr1 10000 11000 0 chr1 29000 30000 3 18000 0.599131
2 chr1 10000 11000 0 chr1 29000 30000 3 18000 0.599131
3 chr1 10000 11000 0 chr1 30000 31000 4 19000 0.604889
4 chr1 10000 11000 0 chr1 91000 92000 15 80000 0.667273
... ... ... ... ... ... ... ... ... ... ...
86022 chr1 248946000 248947000 183982 chr1 248838000 248839000 183892 107000 0.168999
86023 chr1 248946000 248947000 183982 chr1 248859000 248860000 183909 86000 0.176858
86024 chr1 248946000 248947000 183982 chr1 248859000 248860000 183909 86000 0.176858
86025 chr1 248946000 248947000 183982 chr1 248906000 248907000 183953 39000 0.206660
86026 chr1 248946000 248947000 183982 chr1 248912000 248913000 183959 33000 0.489720

86027 rows × 10 columns

[ ]: