Interpret region–region interactions¶
This notebook demonstrates how to infer region–region interactions using the ChromBERT-tools Python API.
interpret_region_region_interactions API: Generate region embeddings, calculate region–region embedding similarities, and infer region–region interactions.
For the bash command-line usage, see `examples/cli/interpret_region_region_interactions.ipynb <../cli/interpret_region_region_interactions.ipynb>`__.
For more details, please refer to the `interpret_region_region_interactions <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_region_region_interactions.html>`__ command documentation
[10]:
from chrombert_tools import interpret_region_region_interactions
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # GPU device
infer region-region interactions (enhancer-promoter loop; only by pretrained chrombert)¶
[2]:
# Returns:
# tss_region_pairs_cos: cosine similarity matrix of region–region representations; indicates interaction strength between enhancer and promoter regions.
tss_region_pairs_cos = interpret_region_region_interactions(
region='../data/hESC_GSM2386582_ATAC.bed', # your focus enhancer region
odir="./output_infer_ep", # output directory
genome="hg38", # Options: "hg38", "mm10"
resolution="1kb", # Options: "1kb", "2kb", "4kb", "200bp"
filter_gene_name="RNVU1-15", # Focus on the ZNF879 promoter; otherwise, consider all genes.
)
Region summary - total: 54408, overlapping with ChromBERT: 56692 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 496
Gene filter: kept 1/55240 TSS rows (gene_name in [1 names], gene_id in [0 ids])
Gene filter: kept 5490/56692 region1 (BED) rows on 1 chromosome(s) matching the selected gene(s)
Finished!
Enhancer-promoter style pairs saved to: ./output_infer_ep/tss_region_pairs_cos.tsv
[3]:
tss_region_pairs_cos
[3]:
| chrom | gene_id | gene_name | tss | tss_build_region_index | distal_region_start | distal_region_end | distal_region_build_region_index | dist | dist_bin | cos_sim | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144546000 | 144547000 | 99004 | 133424 | 79 | 0.966797 |
| 1 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144551000 | 144552000 | 99008 | 138424 | 83 | 0.910645 |
| 2 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144560000 | 144561000 | 99015 | 147424 | 90 | 0.794922 |
| 3 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144461000 | 144462000 | 98949 | 48424 | 24 | 0.699219 |
| 4 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144419000 | 144420000 | 98930 | 6424 | 5 | 0.688477 |
| 5 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144524000 | 144525000 | 98985 | 111424 | 60 | 0.651367 |
| 6 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144567000 | 144568000 | 99021 | 154424 | 96 | 0.431885 |
| 7 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144490000 | 144491000 | 98966 | 77424 | 41 | 0.307373 |
infer enhancer-promoter interactions (celltype-specific fine-tuned model)¶
[4]:
import glob
ft_ckpt_dir = "./output_cell_specific_emb_train/train/**/*.ckpt" # Path pattern for fine-tuned model checkpoints from embed_region.ipynb
ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt
[4]:
'./output_cell_specific_emb_train/train_old/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt'
[5]:
# Returns:
# tss_region_pairs_cos: cosine similarity matrix of region–region representations; indicates interaction strength between enhancer and promoter regions.
tss_region_pairs_cos = interpret_region_region_interactions(
region='../data/myoblast_ENCFF647RNC_peak.bed', # your focus enhancer region
odir="./output_infer_ep_myoblast_specific", # output directory
genome="hg38", # Options: "hg38", "mm10"
resolution="1kb", # Options: "1kb", "2kb", "4kb", "200bp"
ft_ckpt=ft_ckpt,
batch_size=64,
filter_gene_name="RNVU1-15", # Focus on the ZNF879 promoter; otherwise, consider all genes.
)
Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
Gene filter: kept 1/55240 TSS rows (gene_name in [1 names], gene_id in [0 ids])
Gene filter: kept 33479/368260 region1 (BED) rows on 1 chromosome(s) matching the selected gene(s)
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_cell_specific_emb_train/train_old/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=176.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 461/461 [07:37<00:00, 1.01it/s]
Finished!
Enhancer-promoter style pairs saved to: ./output_infer_ep_myoblast_specific/tss_region_pairs_cos.tsv
[6]:
tss_region_pairs_cos
[6]:
| chrom | gene_id | gene_name | tss | tss_build_region_index | distal_region_start | distal_region_end | distal_region_build_region_index | dist | dist_bin | cos_sim | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144546000 | 144547000 | 99004 | 133424 | 79 | 0.974380 |
| 1 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144546000 | 144547000 | 99004 | 133424 | 79 | 0.974380 |
| 2 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144552000 | 144553000 | 99009 | 139424 | 84 | 0.949564 |
| 3 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144551000 | 144552000 | 99008 | 138424 | 83 | 0.934053 |
| 4 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144560000 | 144561000 | 99015 | 147424 | 90 | 0.799996 |
| 5 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144524000 | 144525000 | 98985 | 111424 | 60 | 0.676938 |
| 6 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144413000 | 144414000 | 98926 | 424 | 1 | 0.670687 |
| 7 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144550000 | 144551000 | 99007 | 137424 | 82 | 0.632238 |
| 8 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144419000 | 144420000 | 98930 | 6424 | 5 | 0.625515 |
| 9 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144419000 | 144420000 | 98930 | 6424 | 5 | 0.625515 |
| 10 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144547000 | 144548000 | 99005 | 134424 | 80 | 0.613987 |
| 11 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144523000 | 144524000 | 98984 | 110424 | 59 | 0.595567 |
| 12 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144461000 | 144462000 | 98949 | 48424 | 24 | 0.580753 |
| 13 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144461000 | 144462000 | 98949 | 48424 | 24 | 0.580753 |
| 14 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144545000 | 144546000 | 99003 | 132424 | 78 | 0.529481 |
| 15 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144519000 | 144520000 | 98980 | 106424 | 55 | 0.287048 |
| 16 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144518000 | 144519000 | 98979 | 105424 | 54 | 0.285937 |
| 17 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144518000 | 144519000 | 98979 | 105424 | 54 | 0.285937 |
| 18 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144518000 | 144519000 | 98979 | 105424 | 54 | 0.285937 |
| 19 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144518000 | 144519000 | 98979 | 105424 | 54 | 0.285937 |
| 20 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144494000 | 144495000 | 98968 | 81424 | 43 | 0.273922 |
| 21 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144522000 | 144523000 | 98983 | 109424 | 58 | 0.258144 |
| 22 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144520000 | 144521000 | 98981 | 107424 | 56 | 0.255417 |
| 23 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144490000 | 144491000 | 98966 | 77424 | 41 | 0.246384 |
| 24 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144521000 | 144522000 | 98982 | 108424 | 57 | 0.245028 |
| 25 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144489000 | 144490000 | 98965 | 76424 | 40 | 0.245018 |
| 26 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144517000 | 144518000 | 98978 | 104424 | 53 | 0.239847 |
| 27 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144470000 | 144471000 | 98953 | 57424 | 28 | 0.198687 |
| 28 | chr1 | ENSG00000207205 | RNVU1-15 | 144412576 | 98925 | 144502000 | 144503000 | 98971 | 89424 | 46 | 0.189062 |
infer region-region interactions (two region groups)¶
[2]:
from chrombert_tools import resolve_paths
chrombert_anno_files = resolve_paths(genome="hg38", resolution="1kb",chrombert_cache_dir="~/.cache/chrombert/data")
[3]:
chrombert_anno_files
[3]:
{'chrombert_cache_dir': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data',
'genome': 'hg38',
'resolution': '1kb',
'chrombert_region_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_1kb_region.bed',
'chrombert_regulator_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt',
'chrombert_factor_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_factors_list.txt',
'hdf5_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/hg38_6k_1kb.hdf5',
'pretrain_ckpt': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt',
'mtx_mask': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_mask_matrix.tsv',
'region_emb_npy': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/anno/hg38_1kb_region_emb.npy',
'gene_meta_tsv': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/anno/hg38_1kb_gene_meta.tsv',
'base_ca_signal': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/anno/hg38_1kb_accessibility_signal_mean.npy',
'meta_file': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.json',
'prompt_ckpt': '/mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_prompt_cistrome.ckpt'}
[4]:
import pandas as pd
gene_meta = pd.read_csv(chrombert_anno_files["gene_meta_tsv"],sep='\t')
gene_meta.head()
[4]:
| chrom | loc1 | loc2 | strand | tss | gene_id | gene_name | gene_biotype | start | end | build_region_index | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | 182696 | 184174 | + | 182696 | ENSG00000279928 | DDX11L17 | unprocessed_pseudogene | 182000 | 183000 | 40 |
| 1 | chr1 | 2581560 | 2584533 | + | 2581560 | ENSG00000228037 | NaN | lncRNA | 2581000 | 2582000 | 1762 |
| 2 | chr1 | 3069168 | 3438621 | + | 3069168 | ENSG00000142611 | PRDM16 | protein_coding | 3069000 | 3070000 | 2125 |
| 3 | chr1 | 5301928 | 5307394 | - | 5307394 | ENSG00000284616 | NaN | lncRNA | 5307000 | 5308000 | 3910 |
| 4 | chr1 | 2403964 | 2413797 | - | 2413797 | ENSG00000157911 | PEX10 | protein_coding | 2413000 | 2414000 | 1600 |
[ ]:
all_chr_data = pd.read_csv("../data/hESC_GSM2386582_ATAC.bed",sep="\t",header=None)
all_chr_data[all_chr_data[0]=='chr1'].to_csv("../data/hESC_GSM2386582_ATAC_chr1.bed",sep="\t",header=None,index=False)
[ ]:
# Returns:
# region_region_pairs_cos: cosine similarity matrix of region–region representations; indicates interaction strength between region-group 1 and region-group 2.
region_region_pairs_cos = interpret_region_region_interactions(
region='../data/hESC_GSM2386582_ATAC_chr1.bed', # your focus region group 1
region2=chrombert_anno_files["gene_meta_tsv"], # your focus region group 2
odir="./output_region_region_interactions", # output directory
genome="hg38", # Options: "hg38", "mm10"
resolution="1kb", # Options: "1kb", "2kb", "4kb", "200bp"
)
Region summary - total: 5262, overlapping with ChromBERT: 5490 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 33
Region summary - total: 55240, overlapping with ChromBERT: 55240 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Finished!
Set1 x set2 region-pair cosines (same chrom, 0 <= genomic_dist_bp <= 250000) saved to: ./output_region_region_interactions/region_set_pairs_cos.tsv
[7]:
region_region_pairs_cos
[7]:
| set1_chrom | set1_start | set1_end | set1_build_region_index | set2_chrom | set2_start | set2_end | set2_build_region_index | genomic_dist_bp | cos_sim | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | 10000 | 11000 | 0 | chr1 | 17000 | 18000 | 2 | 6000 | 0.609604 |
| 1 | chr1 | 10000 | 11000 | 0 | chr1 | 29000 | 30000 | 3 | 18000 | 0.599131 |
| 2 | chr1 | 10000 | 11000 | 0 | chr1 | 29000 | 30000 | 3 | 18000 | 0.599131 |
| 3 | chr1 | 10000 | 11000 | 0 | chr1 | 30000 | 31000 | 4 | 19000 | 0.604889 |
| 4 | chr1 | 10000 | 11000 | 0 | chr1 | 91000 | 92000 | 15 | 80000 | 0.667273 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 86022 | chr1 | 248946000 | 248947000 | 183982 | chr1 | 248838000 | 248839000 | 183892 | 107000 | 0.168999 |
| 86023 | chr1 | 248946000 | 248947000 | 183982 | chr1 | 248859000 | 248860000 | 183909 | 86000 | 0.176858 |
| 86024 | chr1 | 248946000 | 248947000 | 183982 | chr1 | 248859000 | 248860000 | 183909 | 86000 | 0.176858 |
| 86025 | chr1 | 248946000 | 248947000 | 183982 | chr1 | 248906000 | 248907000 | 183953 | 39000 | 0.206660 |
| 86026 | chr1 | 248946000 | 248947000 | 183982 | chr1 | 248912000 | 248913000 | 183959 | 33000 | 0.489720 |
86027 rows × 10 columns
[ ]: