Build a model to classify region functions

This notebook shows how to use the ChromBERT-tools Python API region_function_classification to build a model for classifying region functions.

For the bash command-line usage, see `examples/api/region_function_classification.ipynb <../api/region_function_classification.ipynb>`__.

For more details, please refer to the `region_function_classification <https://chrombert-tools.readthedocs.io/en/latest/commands/region_function_classification.html>`__ command documentation

1. build a model to classify the function different regions

[1]:
from chrombert_tools import region_function_classification
[ ]:
### build a model to classify the function different regions
    # group1: ezh2 and h3k27me3 co-binding region
    # group2: ezh2 binding region without h3k27me3

# takes approximately 20-40 minutes to run
results = region_function_classification(
    function_beds=["../data/hESC_GSM1003524_EZH2.bed;../data/hESC_GSM1498900_H3K27me3.bed","../data/hESC_GSM1003524_EZH2.bed"], # your focus region groups
    function_names=["ezh2_h3k27me3","ezh2"], # the names of the region groups
    odir="./output_region_function_classification", # output directory
    ignore_regulator="H3K27me3;H3K27me3/H3K4me3", # the regulators to ignore, because these regulator are already to identify this regions, so we want to ignore them and to find the other important regulators
    genome="hg38", # genome
    resolution="1kb" # resolution
)
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 2, matched in ChromBERT: 2, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Stage 1: Preparing dataset
[ezh2_h3k27me3 | hESC_GSM1003524_EZH2.bed] Region summary - total: 11896, overlapping with ChromBERT: 12101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 325
[ezh2_h3k27me3 | hESC_GSM1498900_H3K27me3.bed] Region summary - total: 8875, overlapping with ChromBERT: 14135 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 2878
  ezh2_h3k27me3 (class 0): 5736 regions
[ezh2] Region summary - total: 11896, overlapping with ChromBERT: 12101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 325
  ezh2 (class 1): 11008 regions
  ezh2: removed 5736 overlapping regions
  ezh2_h3k27me3: 5736 (final)
  ezh2: 5272 (final)
  Total: 11008
  Fast mode: ~10000 regions per class
Finished stage 1
Stage 2: Fine-tuning (binary classification)

[Attempt 0/2] seed=55
/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/chrombert_tools/cli/utils_classfication.py:149: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  combined.groupby("label", group_keys=False)
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Trainer will use only 1 of 4 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=4)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Loading `train_dataloader` to estimate number of stepping batches.
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/lightning/pytorch/utilities/model_summary/model_summary.py:231: Precision bf16-mixed is not supported by the model summary.  Estimated model size in MB will not be accurate. Using 32 bits instead.

  | Name  | Type             | Params | Mode
---------------------------------------------------
0 | model | ChromBERTGeneral | 62.8 M | train
---------------------------------------------------
18.9 M    Trainable params
43.9 M    Non-trainable params
62.8 M    Total params
251.079   Total estimated model params size (MB)
153       Modules in train mode
0         Modules in eval mode
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:484: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
Metric default_validation/auprc improved. New best score: 0.762
Metric default_validation/auprc improved by 0.059 >= min_delta = 0.01. New best score: 0.821
Metric default_validation/auprc improved by 0.023 >= min_delta = 0.01. New best score: 0.844
Metric default_validation/auprc improved by 0.021 >= min_delta = 0.01. New best score: 0.865
Metric default_validation/auprc improved by 0.030 >= min_delta = 0.01. New best score: 0.895
Metric default_validation/auprc improved by 0.011 >= min_delta = 0.01. New best score: 0.906
Monitored metric default_validation/auprc did not improve in the last 5 records. Best score: 0.906. Signaling Trainer to stop.
Evaluating the finetuned model performance
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
ft_ckpt: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt, test_metrics: {'auroc': 0.88494473695755, 'auprc': 0.8885455131530762, 'mcc': 0.5954238772392273, 'f1': 0.7598719596862793, 'precision': 0.8496419787406921, 'recall': 0.6872586607933044}
Attempt metrics: auprc=0.8885455131530762
Accepted run (auprc=0.8885 >= 0.2).

Finished stage 2: obtained a fine-tuned ChromBERT
Best auprc=0.8885455131530762, metrics={'auroc': 0.88494473695755, 'auprc': 0.8885455131530762, 'mcc': 0.5954238772392273, 'f1': 0.7598719596862793, 'precision': 0.8496419787406921, 'recall': 0.6872586607933044, 'ft_ckpt': '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt'}
Finished stage 2
Stage 3: Predicting
Region summary - total: 1101, overlapping with ChromBERT: 1101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
  Predict input: ./output_region_function_classification/predict/model_input.tsv
Predicting: 100%|██████████| 276/276 [00:54<00:00,  5.05it/s]
  Predictions saved: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/predict/predictions.csv  (1101 regions)
Finished stage 3

============================================================
All stages completed!
============================================================
Fine-tuned model saved in: ./output_region_function_classification/train/try_00_seed_55
Classes (2): ezh2_h3k27me3, ezh2
Predictions: ./output_region_function_classification/predict/predictions.csv
Fine-tuned checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt
Predictions: ./output_region_function_classification/predict/predictions.csv

[ ]:
results.model_ckpt # fine-tuned model checkpoint

'/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt'
[ ]:
results.predictions_df ### test prediction, these predicted_label by threshold 0.5, you can change the threshold to get the different results
chrom start end build_region_index true_label prob_ezh2_h3k27me3 predicted_label predicted_name
0 chr1 960000 961000 255 0 0.589054 1 ezh2
1 chr1 1274000 1275000 553 1 0.179563 0 ezh2_h3k27me3
2 chr1 3072000 3073000 2128 0 0.193262 0 ezh2_h3k27me3
3 chr1 3392000 3393000 2425 0 0.566183 1 ezh2
4 chr1 3393000 3394000 2426 0 0.194490 0 ezh2_h3k27me3
... ... ... ... ... ... ... ... ...
1096 chrX 118976000 118977000 2116903 0 0.231761 0 ezh2_h3k27me3
1097 chrX 119999000 120000000 2117687 0 0.514951 1 ezh2
1098 chrX 131073000 131074000 2123007 1 0.426829 0 ezh2_h3k27me3
1099 chrX 134545000 134546000 2125186 1 0.592574 1 ezh2
1100 chrX 150693000 150694000 2132345 1 0.970415 1 ezh2

1101 rows × 8 columns

[ ]:
# calculate the AUC and AUPRC
from sklearn.metrics import roc_auc_score, average_precision_score

roc_auc_score(results.predictions_df.true_label,results.predictions_df.prob_ezh2_h3k27me3),average_precision_score(results.predictions_df.true_label,results.predictions_df.prob_ezh2_h3k27me3)

(0.8849447340013379, 0.8885454977484899)
[ ]:
results.model_config # model configuration
'./output_region_function_classification/model_config.json'
[ ]:
results.data_config # data configuration
'./output_region_function_classification/dataset_config.json'
[19]:
with open(results.model_config,"r") as f:
    print(f.read())
{"genome": "hg38", "task": "general", "dim_output": 1, "pretrained_model_name_or_path": "/mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt", "trust_remote_code": true, "pretrained_model_kwargs": {}, "pretrain_ckpt": "/mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt", "mtx_mask": "/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_mask_matrix.tsv", "dropout": 0, "finetune_ckpt": "/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt", "ignore": true, "ignore_index": [[972, 518, 1296, 1065, 730, 931, 806, 2601, 1376, 559, 1019, 1760, 1041, 1152, 1787, 2313, 1792, 745, 749, 1026, 2599, 2815, 938, 2707, 750, 2299, 571, 669, 798, 1070, 2962, 499, 2731, 2899, 2981, 2741, 1371, 2435, 1036, 904, 889, 751, 1052, 1902, 2157, 2436, 2414, 1014, 1413, 1357, 2776, 1761, 2727, 1478, 515, 955, 1377, 526, 1048, 1031, 1797, 663, 2494, 1209, 553, 1127, 2585, 2893, 1424, 1022, 1120, 2306, 550, 2432, 1765, 2553, 2483, 2728, 2841, 866, 1187, 1150, 1333, 1456, 2846, 1318, 2740, 2945, 1221, 1780, 881, 2623, 1454, 641, 2639, 1027, 2655, 581, 2819, 2486, 1781, 1708, 1010, 1338, 1093, 2212, 2692, 2213, 799, 1135, 2805, 1455, 522, 770, 697, 1908, 2441, 2904, 2394, 2190, 1453, 654, 786, 2914, 690, 1268, 1442, 1476, 2818, 2502, 2650, 1077, 2762, 1082, 2362, 1247, 1332, 818, 1261, 1265, 1055, 1791, 2196, 2594, 1784, 1213, 2712, 2890, 879, 2437, 1408, 2334, 706, 2916, 1439, 1299, 1139, 1440, 988, 1295, 1335, 2552, 1155, 2294, 2849, 2816, 869, 880, 1290, 2285, 1421, 2364, 2368, 2606, 2549, 1795, 2490, 1005, 2284, 1254, 1301, 1941, 2315, 682, 1044, 1060, 2286, 2647, 1294, 1111, 2489, 2838, 1107, 1121, 1790, 1764, 775, 2706, 978, 1374, 984, 705, 908, 2702, 1360, 1206], [312, 313]], "gep_flank_window": 4, "gep_parallel_embedding": false, "gep_gradient_checkpoint": false, "gep_zero_inflation": false, "prompt_kind": "cistrome", "prompt_dim_external": 512, "dnabert2_ckpt": null}
[21]:
with open(results.data_config,"r") as f:
    print(f.read())
{
    "hdf5_file": "/mnt/Storage/home/chenqianqian/.cache/chrombert/data/hg38_6k_1kb.hdf5",
    "supervised_file": "./output_region_function_classification/predict/model_input.tsv",
    "kind": "GeneralDataset",
    "meta_file": "/mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_meta.json",
    "ignore": true,
    "ignore_object": "h3k27me3;h3k27me3/h3k4me3",
    "batch_size": 4,
    "num_workers": 8,
    "shuffle": false,
    "pin_memory": true,
    "perturbation": false,
    "perturbation_object": null,
    "perturbation_value": 0,
    "prompt_kind": null,
    "prompt_regulator": null,
    "prompt_regulator_cache_file": null,
    "prompt_celltype": null,
    "prompt_celltype_cache_file": null,
    "prompt_regulator_cache_pin_memory": false,
    "prompt_regulator_cache_limit": 3,
    "fasta_file": null,
    "flank_window": 0
}
[20]:
results.model_ckpt
[20]:
'/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt'

2.Interpretation

After building a model, we can use the interpretation layer of ChromBERT-tools to derive biological insights.

We display the important factors between different region groups using the fine-tuned checkpoint.

[ ]:
ft_ckpt = results.model_ckpt # fine-tuned checkpoint generated in Step 1
# ft_ckpt = '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt'
ft_ckpt
'/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt'
[4]:
from chrombert_tools import interpret_regulator_effects_between_region_groups as run_key_regulator
[ ]:
import pandas as pd
test_data = pd.read_csv("./output_region_function_classification/dataset/test_sampled.csv") # test data generated in Step 1
odir="interpretation_important_factor_effect_ezh2"
!mkdir -p "interpretation_important_factor_effect_ezh2"
[ ]:
ezh2_k27me3_region_sample_file = f"{odir}/ezh2_h3k27me3_region_sample.csv"
ezh2_only_region_sample_file = f"{odir}/ezh2_only_region_sample.csv"
test_data.query("label==0").to_csv(ezh2_k27me3_region_sample_file,index=False) # Class 0 regions in the test data
test_data.query("label==1").to_csv(ezh2_only_region_sample_file,index=False) # Class 1 regions in the test data
[ ]:
# identify the important factors for the different regions
factor_importance_rank = run_key_regulator(
    region1_file=ezh2_k27me3_region_sample_file, # your focus region group 1, Class 0 regions in the test data
    region2_file=ezh2_only_region_sample_file, # your focus region group 2, Class 1 regions in the test data
    odir=odir, # output directory
    genome="hg38", # genome
    resolution="1kb", # resolution
    ft_ckpt=ft_ckpt, # fine-tuned model checkpoint
    batch_size=64, # batch size
    # model_config='./output_region_function_classification/model_config.json',
    # data_config='./output_region_function_classification/dataset_config.json',
    model_config=results.model_config, # model configuration generated in Step 1
    data_config=results.data_config, # data configuration generated in Step 1
)
Region summary - total: 583, overlapping with ChromBERT: 583 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 518, overlapping with ChromBERT: 518 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 146/146 [00:11<00:00, 12.51it/s]
100%|██████████| 130/130 [00:09<00:00, 13.25it/s]
Identify key regulators across regions(top 25)
      factors  similarity  rank  embedding_shift
0       suz12    0.828531     1         0.171469
1       dnase    0.841583     2         0.158417
2        ezh2    0.871878     3         0.128122
3        ctcf    0.889448     4         0.110552
4       ssu72    0.890448     5         0.109552
5      supt5h    0.892739     6         0.107261
6       fgfr1    0.894543     7         0.105457
7        cbx3    0.894630     8         0.105370
8         sp4    0.895236     9         0.104764
9         tbp    0.896751    10         0.103249
10       ezh1    0.896976    11         0.103024
11       chd8    0.897261    12         0.102739
12       taf7    0.897315    13         0.102685
13    h3k4me3    0.897389    14         0.102611
14      ep300    0.898191    15         0.101809
15    h3k27ac    0.899479    16         0.100521
16       taf1    0.899601    17         0.100399
17   atac-seq    0.899675    18         0.100325
18     polr2a    0.899913    19         0.100087
19      sin3a    0.901205    20         0.098795
20  h2ak119ub    0.901446    21         0.098554
21       brd4    0.901535    22         0.098465
22      creb1    0.903816    23         0.096184
23     h3k9ac    0.904093    24         0.095907
24       cdk9    0.904323    25         0.095677
Finished
Used fine-tuned ChromBERT checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt
Key regulators across regions saved to: interpretation_important_factor_effect_ezh2/results/factor_importance_rank.csv
[8]:
factor_importance_rank.head(n=25)
[8]:
factors similarity rank embedding_shift
0 suz12 0.828531 1 0.171469
1 dnase 0.841583 2 0.158417
2 ezh2 0.871878 3 0.128122
3 ctcf 0.889448 4 0.110552
4 ssu72 0.890448 5 0.109552
5 supt5h 0.892739 6 0.107261
6 fgfr1 0.894543 7 0.105457
7 cbx3 0.894630 8 0.105370
8 sp4 0.895236 9 0.104764
9 tbp 0.896751 10 0.103249
10 ezh1 0.896976 11 0.103024
11 chd8 0.897261 12 0.102739
12 taf7 0.897315 13 0.102685
13 h3k4me3 0.897389 14 0.102611
14 ep300 0.898191 15 0.101809
15 h3k27ac 0.899479 16 0.100521
16 taf1 0.899601 17 0.100399
17 atac-seq 0.899675 18 0.100325
18 polr2a 0.899913 19 0.100087
19 sin3a 0.901205 20 0.098795
20 h2ak119ub 0.901446 21 0.098554
21 brd4 0.901535 22 0.098465
22 creb1 0.903816 23 0.096184
23 h3k9ac 0.904093 24 0.095907
24 cdk9 0.904323 25 0.095677

2.Interpretation (cofactor)

After building a model, we can use the interpretation layer of ChromBERT-tools to derive biological insights.

Here, we identify important EZH2 cofactors across different region groups using the fine-tuned checkpoint.

[ ]:
from chrombert_tools import interpret_regulator_regulator_interactions
[ ]:
# identify the important cofactors of EZH2 for ezh2_h3k27me3 region (class 0)
df_cos_sim_1, df_edges_1 = interpret_regulator_regulator_interactions(
    region=ezh2_k27me3_region_sample_file, # your focus region group 1, Class 0 regions in the test data
    regulator="ezh2",      # focus regulator
    odir="./output_regulator_network_ezh2_1", # output directory
    ft_ckpt=ft_ckpt, # fine-tuned checkpoint generated in Step 1
    genome="hg38",                      # Options: "hg38", "mm10"
    resolution="1kb",                   # Options: "1kb", "2kb", "4kb", "200bp"
    quantile=0.95,
    # model_config='./output_region_function_classification/model_config.json',
    # data_config='./output_region_function_classification/dataset_config.json',
    model_config=results.model_config, # model configuration generated in Step 1
    data_config=results.data_config, # data configuration generated in Step 1
)
Region summary - total: 583, overlapping with ChromBERT: 583 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 1, matched in ChromBERT: 1, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 146/146 [00:20<00:00,  7.23it/s]
Total graph nodes: 1015
Total graph edges (threshold=0.625): 28650
Regulator subnetwork saved to: ./output_regulator_network_ezh2_1/subnetwork_ezh2_k1_q0.950_thr0.625.pdf
Finished!
Saved outputs to: ./output_regulator_network_ezh2_1
Regulator cosine similarity saved to: ./output_regulator_network_ezh2_1/regulator_cosine_similarity.tsv
Total graph edges saved to: ./output_regulator_network_ezh2_1/total_graph_edge_threshold0.625_quantile0.950.tsv
../../_images/examples_api_region_function_classification_22_3.png
[ ]:
# identify the important cofactors of EZH2 for ezh2 binding region (class 1)
df_cos_sim_2, df_edges_2 = interpret_regulator_regulator_interactions(
    region=ezh2_only_region_sample_file, # your focus region group 2, Class 1 regions in the test data
    regulator="ezh2",      # Plot subnetworks for these regulators
    odir="./output_regulator_network_ezh2_2", # output directory
    ft_ckpt=ft_ckpt, # fine-tuned model checkpoint generated in Step 1
    genome="hg38",                      # Options: "hg38", "mm10"
    resolution="1kb",                   # Options: "1kb", "2kb", "4kb", "200bp"
    quantile=0.95,
    # model_config='./output_region_function_classification/model_config.json',
    # data_config='./output_region_function_classification/dataset_config.json',
    model_config=results.model_config, # model configuration generated in Step 1
    data_config=results.data_config, # data configuration generated in Step 1
)
Region summary - total: 518, overlapping with ChromBERT: 518 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 1, matched in ChromBERT: 1, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_function_classification/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=90.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 130/130 [00:20<00:00,  6.32it/s]
Total graph nodes: 1033
Total graph edges (threshold=0.566): 28650
Regulator subnetwork saved to: ./output_regulator_network_ezh2_2/subnetwork_ezh2_k1_q0.950_thr0.566.pdf
Finished!
Saved outputs to: ./output_regulator_network_ezh2_2
Regulator cosine similarity saved to: ./output_regulator_network_ezh2_2/regulator_cosine_similarity.tsv
Total graph edges saved to: ./output_regulator_network_ezh2_2/total_graph_edge_threshold0.566_quantile0.950.tsv
../../_images/examples_api_region_function_classification_23_3.png
[ ]:
# ezh2 subnetwork on different regions
df_edges_2_ezh2 = df_edges_2.query("node1=='ezh2'")
df_edges_1_ezh2 = df_edges_1.query("node1=='ezh2'")
[ ]:
# merge the ezh2 subnetwork on different regions
merge_df = pd.merge(df_edges_1_ezh2, df_edges_2_ezh2, on="node2", how="outer",suffixes=("_ezh2_1", "_ezh2_2"))
merge_df.query("node1_ezh2_1 != node1_ezh2_1") # specific cofactor on function region2
node1_ezh2_1 node2 cosine_similarity_ezh2_1 node1_ezh2_2 cosine_similarity_ezh2_2
0 NaN foxm1 NaN ezh2 0.621829
2 NaN h3 NaN ezh2 0.581819
3 NaN hdac1 NaN ezh2 0.577564
4 NaN hinfp NaN ezh2 0.601859
6 NaN junb NaN ezh2 0.571182
8 NaN med1 NaN ezh2 0.582780
11 NaN polr3d NaN ezh2 0.578753
12 NaN rela NaN ezh2 0.589152
13 NaN rest NaN ezh2 0.584213
16 NaN smad4 NaN ezh2 0.570978
17 NaN smarca4 NaN ezh2 0.567565
18 NaN stat1 NaN ezh2 0.609123
19 NaN stat3 NaN ezh2 0.637027
21 NaN tcf7l2 NaN ezh2 0.572582
22 NaN tfdp1 NaN ezh2 0.566373
23 NaN tp53 NaN ezh2 0.580335
24 NaN ubtf NaN ezh2 0.636960
25 NaN zbtb33 NaN ezh2 0.588106
26 NaN znf274 NaN ezh2 0.581549
[21]:
merge_df.query("node1_ezh2_2 != node1_ezh2_2") # specific cofactor on function region1
[21]:
node1_ezh2_1 node2 cosine_similarity_ezh2_1 node1_ezh2_2 cosine_similarity_ezh2_2
5 ezh2 jarid2 0.633189 NaN NaN
7 ezh2 kdm2b 0.625543 NaN NaN
9 ezh2 pcgf1 0.648713 NaN NaN
[23]:
merge_df.query("node1_ezh2_2 == node1_ezh2_2 and node1_ezh2_1 == node1_ezh2_1 and (cosine_similarity_ezh2_1 - cosine_similarity_ezh2_2 > 0.1 or cosine_similarity_ezh2_1 - cosine_similarity_ezh2_2 < -0.1)") # specific cofactor on function region1 and region2
[23]:
node1_ezh2_1 node2 cosine_similarity_ezh2_1 node1_ezh2_2 cosine_similarity_ezh2_2
1 ezh2 h2ak119ub 0.717315 ezh2 0.60734
[ ]:

Or focus on another interpretation of your choice

See also the other interpretation tutorials (e.g. in examples/api/)

[ ]: