Interpret region-region interactions

Note: The remaining examples show command-line usage only (bash).

interpret_region_region_interactions subcommand: Uses the pre-trained ChromBERT model or fine-tuned ChromBERT to infer region-region interaction on user-specified enhancer regions.

For the Python API, see `examples/api/interpret_region_region_interactions.ipynb <../api/interpret_region_region_interactions.ipynb>`__.

If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.

For more details, please refer to the `interpret_region_region_interactions <https://chrombert-tools.readthedocs.io/en/latest/commands/interpret_region_region_interactions.html>`__ command documentation

infer region-region interactions (enhancer-promoter loop; only by pretrained chrombert)

[ ]:
%%bash
# --region: your focus enhancer region
# --odir: output directory
# --genome: genome
# --resolution: resolution
# --gene: focus on the gene promoter; otherwise, consider all genes.
chrombert-tools interpret_region_region_interactions \
    --region '../data/hESC_GSM2386582_ATAC.bed' \
    --odir "./output_infer_ep" \
    --genome "hg38" \
    --resolution "1kb" \
    --gene "RNVU1-15"

Region summary - total: 5262, overlapping with ChromBERT: 5490 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 33
  Gene filter: kept 1/55240 TSS rows (gene_name in [1 names], gene_id in [0 ids])
Finished!
Enhancer-promoter style pairs saved to: ./output_infer_ep/tss_region_pairs_cos.tsv
[5]:
# infer enhancer-promoter loop
# cos_sim: cosine similarity between the enhancer region embedding and the gene promoter (TSS) region embedding; higher values indicate a more likely enhancer–promoter loop.
import pandas as pd
tss_region_pairs_cos = pd.read_csv("output_infer_ep/tss_region_pairs_cos.tsv",sep='\t')
tss_region_pairs_cos

[5]:
chrom gene_id gene_name tss tss_build_region_index distal_region_start distal_region_end distal_region_build_region_index dist dist_bin cos_sim
0 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144546000 144547000 99004 133424 79 0.966797
1 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144551000 144552000 99008 138424 83 0.910645
2 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144560000 144561000 99015 147424 90 0.794922
3 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144461000 144462000 98949 48424 24 0.699219
4 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144419000 144420000 98930 6424 5 0.688477
5 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144524000 144525000 98985 111424 60 0.651367
6 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144567000 144568000 99021 154424 96 0.431885
7 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144490000 144491000 98966 77424 41 0.307373

infer enhancer-promoter interactions (celltype-specific fine-tuned model)

[ ]:
# Download example data
# Myoblast and fibroblast data: ATAC-seq peak files
import subprocess
import os
if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
    subprocess.run(cmd, shell=True)
    cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
    subprocess.run(cmd, shell=True)

[ ]:
# # Download example data
# # Myoblast and fibroblast data: ATAC-seq bigWig and peak files
# import subprocess
# import os
# if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
#     subprocess.run(cmd, shell=True)
#     cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
#     subprocess.run(cmd, shell=True)

# if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
#     cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
#     subprocess.run(cmd, shell=True)


## fine-tuned a cell-type-specific model
# '''
# --odir: output directory
# --acc_signal1: cell-type-specific accessibility signal
# --acc_peak1: cell-type-specific peak
# --genome: genome
# --resolution: resolution
# '''
# !chrombert-tools region_activity_regression \
# --odir "./output_cell_specific_emb_train" \
# --acc_signal1 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
# --acc_peak1 "../data/myoblast_ENCFF647RNC_peak.bed" \
# --genome "hg38" \
# --resolution "1kb"

[7]:
import glob
ft_ckpt_dir = "./output_cell_specific_emb_train/train/**/*.ckpt"  # Use checkpoints from embed_region.ipynb if available; otherwise, run the code above first

ft_ckpt = glob.glob(ft_ckpt_dir, recursive=True)[0]
ft_ckpt
[7]:
'./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=1-step=126.ckpt'
[11]:
# --region: your focus enhancer region
# --odir: output directory
# --genome: genome
# --resolution: resolution
# --ft-ckpt: fine-tuned model checkpoint
# --batch-size: batch size
# --gene: focus on the gene promoter; otherwise, consider all genes.
!export CUDA_VISIBLE_DEVICES=1
!chrombert-tools interpret_region_region_interactions \
    --region '../data/myoblast_ENCFF647RNC_peak.bed' \
    --odir "./output_infer_ep_myoblast_specific" \
    --genome "hg38" \
    --resolution "1kb" \
    --ft-ckpt {ft_ckpt} \
    --batch-size 64 \
    --gene "RNVU1-15"

Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
  Gene filter: kept 1/55240 TSS rows (gene_name in [1 names], gene_id in [0 ids])
  Gene filter: kept 33479/368260 region1 (BED) rows on 1 chromosome(s) matching the selected gene(s)
Your supervised_file does not contain the 'label' column. Please verify whether ground truth column ('label') is required. If it is not needed, you may disregard this message.
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_cell_specific_emb_train/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=1-step=126.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|█████████████████████████████████████████| 461/461 [07:41<00:00,  1.00s/it]
Finished!
Enhancer-promoter style pairs saved to: ./output_infer_ep_myoblast_specific/tss_region_pairs_cos.tsv
[12]:
# infer enhancer-promoter loop
# cos_sim: cosine similarity between the enhancer region embedding and the gene promoter (TSS) region embedding; higher values indicate a more likely enhancer–promoter loop.
tss_region_pairs_cos_myoblast = pd.read_csv("output_infer_ep_myoblast_specific/tss_region_pairs_cos.tsv",sep='\t')
tss_region_pairs_cos_myoblast

[12]:
chrom gene_id gene_name tss tss_build_region_index distal_region_start distal_region_end distal_region_build_region_index dist dist_bin cos_sim
0 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144546000 144547000 99004 133424 79 0.969920
1 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144546000 144547000 99004 133424 79 0.969920
2 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144552000 144553000 99009 139424 84 0.936230
3 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144551000 144552000 99008 138424 83 0.921858
4 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144560000 144561000 99015 147424 90 0.744101
5 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144413000 144414000 98926 424 1 0.613024
6 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144461000 144462000 98949 48424 24 0.592507
7 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144461000 144462000 98949 48424 24 0.592507
8 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144524000 144525000 98985 111424 60 0.569692
9 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144419000 144420000 98930 6424 5 0.567447
10 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144419000 144420000 98930 6424 5 0.567447
11 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144547000 144548000 99005 134424 80 0.550278
12 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144523000 144524000 98984 110424 59 0.495028
13 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144550000 144551000 99007 137424 82 0.486923
14 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144545000 144546000 99003 132424 78 0.414352
15 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144519000 144520000 98980 106424 55 0.222717
16 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144522000 144523000 98983 109424 58 0.199014
17 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144518000 144519000 98979 105424 54 0.195656
18 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144518000 144519000 98979 105424 54 0.195656
19 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144518000 144519000 98979 105424 54 0.195656
20 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144518000 144519000 98979 105424 54 0.195656
21 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144489000 144490000 98965 76424 40 0.194427
22 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144490000 144491000 98966 77424 41 0.192645
23 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144494000 144495000 98968 81424 43 0.182398
24 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144521000 144522000 98982 108424 57 0.179958
25 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144520000 144521000 98981 107424 56 0.174450
26 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144517000 144518000 98978 104424 53 0.173654
27 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144470000 144471000 98953 57424 28 0.126555
28 chr1 ENSG00000207205 RNVU1-15 144412576 98925 144502000 144503000 98971 89424 46 0.114761