Identify context-specific cofactors in different regions.

The predict_regulator_context_cofactors command trains a classifier to distinguish two sets of genomic regions, identifies regulatory factors that contribute most to the classification, and prioritizes context-specific cofactors for dual-functional target factors across the two contexts.

Note: The remaining examples will only show the direct command usage.

If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.

For more details, please refer to the `predict_regulator_context_cofactors <https://chrombert-tools.readthedocs.io/en/latest/commands/predict_regulator_context_cofactors.html>`__ command documentation

Example:

EZH2, the catalytic subunit of Polycomb Repressive Complex 2 (PRC2), operates in two distinct modes: a classical H3K27me3-dependent repressive role and a non-classical H3K27me3-independent role.

We fine-tuned ChromBERT to classify EZH2 ChIP-seq peaks in human embryonic stem cells into classical and non-classical categories.

Using this fine-tuned model, we identify regulators that most strongly distinguish the two region sets and visualize context-specific EZH2 cofactor subnetworks.

[7]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'
[1]:
# options parameter
!chrombert-tools predict_regulator_context_cofactors -h
Usage: chrombert-tools predict_regulator_context_cofactors
           [OPTIONS]

  Identify context-specific cofactors across two or more region-function
  classes for user-specified dual-functional regulators.

  Typical usage

  Provide two or more functional region classes with --function-bed to define
  the region groups to be compared.

  Provide --dual-regulator to specify the dual-functional regulator of
  interest. The workflow then compares cofactor patterns across the defined
  region classes for that regulator.

  Output

  For every pair of classes, the workflow writes results to:
  {odir}/results/<label_pair_subdir>/

Options:
  --function-bed TEXT             BED file(s) defining one functional region
                                  class. Repeat this option to provide
                                  multiple classes. Use ';' to combine
                                  multiple BED files into the same class. If
                                  only one class is provided, a background
                                  class will be added automatically.
  --function-mode [and|or]        How to combine multiple BED files within one
                                  class: 'and' for intersection, 'or' for
                                  union. Provide one value to reuse for all
                                  classes, or repeat to match each --function-
                                  bed. Default: and.
  --function-name TEXT            Optional display name for each function
                                  class. Repeat to match each --function-bed.
                                  If not provided, default names such as
                                  function_0, function_1, ... are used.
  --dual-regulator TEXT           Regulator(s) of special interest for dual-
                                  function analysis. Use ';' to separate
                                  multiple regulators.  [required]
  --ignore-regulator TEXT         Regulator(s) to exclude from analysis. Use
                                  ';' to separate multiple regulators.
  --odir DIRECTORY                Output directory for datasets, embeddings,
                                  trained models, and pairwise comparison
                                  results.  [default: ./output]
  --genome [hg38|mm10]            Reference genome used by ChromBERT
                                  resources.  [default: hg38]
  --resolution [200bp|1kb|2kb|4kb]
                                  Resolution of the ChromBERT representation.
                                  [default: 1kb]
  --ft-ckpt FILE                  Path to a fine-tuned ChromBERT checkpoint.
                                  If provided, the workflow uses this
                                  checkpoint instead of training a new model.
  --threshold FLOAT               Threshold for the embedding similarity
                                  difference between two regions used to
                                  identify a regulator as a context-dependent
                                  cofactor.  [default: 0.1]
  --quantile FLOAT                Quantile threshold for cosine similarity
                                  edges. Default: 0.95.  [default: 0.95]
  --batch-size INTEGER            Batch size used for embedding generation and
                                  model inference.  [default: 4]
  --mode [fast|full]              Run mode. 'fast' uses a sampled subset of
                                  regions for quicker execution, while 'full'
                                  uses all eligible regions.  [default: fast]
  --chrombert-cache-dir DIRECTORY
                                  Directory containing ChromBERT cached
                                  resources, such as config files,
                                  annotations, checkpoints, and metadata.
                                  [default: /mnt/Storage/home/chenqianqian/.ca
                                  che/chrombert/data]
  -h, --help                      Show this message and exit.

Run

[4]:
!mkdir -p ./tmp
[ ]:
# takes approximately 20-40 minutes to run
# --function-bed: bed files of focus regions, Repeat this option to provide multiple classes.
# --function-name: Name of each functional region class.
# --dual-regulator: dual-functional regulator
# --ignore-regulator: Regulators to mask during fine-tuning
!chrombert-tools predict_regulator_context_cofactors \
    --function-bed "../data/hESC_GSM1003524_EZH2.bed;../data/hESC_GSM1498900_H3K27me3.bed" \
    --function-bed "../data/hESC_GSM1003524_EZH2.bed" \
    --function-name "EZH2-H3K27me3" \
    --function-name "EZH2" \
    --dual-regulator "EZH2" \
    --ignore-regulator "H3K27me3;H3K27me3/H3K4me3" \
    --odir "./output_find_context_specific_cofactor" \
    --genome "hg38" \
    --resolution "1kb"  2> "./tmp/predict_regulator_context_cofactors.log" # redirect stderr to log file
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 2, matched in ChromBERT: 2, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 1, matched in ChromBERT: 1, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Step 1/3: Preparing labeled region dataset...
[EZH2-H3K27me3 | hESC_GSM1003524_EZH2.bed] Region summary - total: 11896, overlapping with ChromBERT: 12101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 325
[EZH2-H3K27me3 | hESC_GSM1498900_H3K27me3.bed] Region summary - total: 8875, overlapping with ChromBERT: 14135 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 2878
  EZH2-H3K27me3 (class 0): 5736 regions
[EZH2] Region summary - total: 11896, overlapping with ChromBERT: 12101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 325
  EZH2 (class 1): 11008 regions
  EZH2: removed 5736 overlapping regions
  EZH2-H3K27me3: 5736 (final)
  EZH2: 5272 (final)
  Total: 11008
  Fast mode: ~10000 regions per class
Finished step 1: labeled dataset prepared.
Step 2/3: Loading or training the model...
Stage 2: Fine-tuning (binary classification)

[Attempt 0/2] seed=55
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Epoch 0:  20%|████▍                 | 440/2202 [01:14<05:00,  5.87it/s, v_num=0]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.60it/s]
Epoch 0:  40%|▍| 880/2202 [02:42<04:04,  5.40it/s, v_num=0, default_validation/b
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.51it/s]
Epoch 0:  60%|▌| 1320/2202 [04:11<02:47,  5.26it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.48it/s]
Epoch 0:  80%|▊| 1760/2202 [05:39<01:25,  5.18it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.53it/s]
Epoch 0: 100%|▉| 2200/2202 [07:07<00:00,  5.14it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.52it/s]
Epoch 1:  20%|▏| 440/2202 [01:13<04:55,  5.96it/s, v_num=0, default_validation/b
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.49it/s]
Epoch 1:  40%|▍| 880/2202 [02:42<04:03,  5.43it/s, v_num=0, default_validation/b
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.55it/s]
Epoch 1:  60%|▌| 1320/2202 [04:09<02:46,  5.30it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.42it/s]
Epoch 1:  80%|▊| 1760/2202 [05:37<01:24,  5.21it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.46it/s]
Epoch 1: 100%|▉| 2200/2202 [07:06<00:00,  5.16it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.49it/s]
Epoch 2:  20%|▏| 440/2202 [01:13<04:55,  5.97it/s, v_num=0, default_validation/b
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.52it/s]
Epoch 2:  40%|▍| 880/2202 [02:40<04:01,  5.48it/s, v_num=0, default_validation/b
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.45it/s]
Epoch 2:  60%|▌| 1320/2202 [04:07<02:45,  5.33it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.53it/s]
Epoch 2:  80%|▊| 1760/2202 [05:35<01:24,  5.24it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.52it/s]
Epoch 2:  80%|▊| 1760/2202 [05:50<01:28,  5.02it/s, v_num=0, default_validation/
Evaluating the finetuned model performance
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_context_specific_cofactor/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=97.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
ft_ckpt: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_context_specific_cofactor/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=97.ckpt, test_metrics: {'auroc': 0.8930243849754333, 'auprc': 0.8924592733383179, 'mcc': 0.5851843357086182, 'f1': 0.7249712347984314, 'precision': 0.8974359035491943, 'recall': 0.6081081032752991}
Attempt metrics: auprc=0.8924592733383179
Accepted run (auprc=0.8925 >= 0.2).

Finished stage 2: obtained a fine-tuned ChromBERT
Best auprc=0.8924592733383179, metrics={'auroc': 0.8930243849754333, 'auprc': 0.8924592733383179, 'mcc': 0.5851843357086182, 'f1': 0.7249712347984314, 'precision': 0.8974359035491943, 'recall': 0.6081081032752991, 'ft_ckpt': '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_context_specific_cofactor/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=97.ckpt'}
Finished step 2: model ready for embedding generation.
Step 3/3: Generating embeddings and running pairwise cofactor comparisons...
Comparing 'EZH2-H3K27me3' vs 'EZH2' ...
Results will be written to: results/0_1_EZH2-H3K27me3_vs_EZH2/
Total graph nodes: 1017
Total graph edges (threshold=0.619): 28650
Total graph nodes: 1035
Total graph edges (threshold=0.568): 28650
Dual regulator subnetwork plot saved: ./output_find_context_specific_cofactor/results/0_1_EZH2-H3K27me3_vs_EZH2/dual_regulator_ezh2_subnetwork.pdf
Yellow color represents function1 subnetwork; blue color represents function2 subnetwork.
Finished step: infer dual-functional regulator subnetworks
Finished step 3: all pairwise comparisons completed.
Compared 2 classes and generated 1 pairwise results.
Final results are available under: ./output_find_context_specific_cofactor/results/<label_pair_subdir>/
[17]:
# regulator_cosine_similarity_on_function1_regions.csv: cosine similarity of regulator-regulator pairs on function1 regions:
#   - node1: regulator name
#   - node2: regulator name
#   - similarity: cosine similarity of regulator embeddings between function1 regions


# regulator_cosine_similarity_on_function2_regions.csv: cosine similarity of regulator-regulator pairs on function2 regions:
#   - node1: regulator name
#   - node2: regulator name
#   - similarity: cosine similarity of regulator embeddings between function2 regions
import pandas as pd
reg_cos_sim_func1 = pd.read_csv("./output_find_context_specific_cofactor/results/0_1_EZH2-H3K27me3_vs_EZH2/func1/regulator_cosine_similarity.tsv",sep="\t",index_col=0)
reg_cos_sim_func2 = pd.read_csv("./output_find_context_specific_cofactor/results/0_1_EZH2-H3K27me3_vs_EZH2/func2/regulator_cosine_similarity.tsv",sep="\t",index_col=0)


[18]:
reg_cos_sim_func2
[18]:
5hmc adnp aebp2 aff1 aff4 ago1 ago2 ahr ahrr alkbh3 ... zscan20 zscan22 zscan23 zscan29 zscan31 zscan5a zta zxdb zxdc zzz3
5hmc 1.000000 0.263275 0.312532 0.201742 0.192320 0.279180 0.241895 0.216915 0.186114 0.227855 ... 0.383006 0.135186 0.437668 0.285964 0.172695 0.350433 0.288472 0.293085 0.213696 0.296846
adnp 0.263275 1.000000 0.656210 0.365337 0.493897 0.160843 0.232765 0.294296 0.314429 0.291633 ... 0.393143 0.353003 0.448869 0.544043 0.380984 0.447373 0.335509 0.425058 0.357726 0.430030
aebp2 0.312532 0.656210 1.000000 0.280973 0.423670 0.248505 0.309748 0.337397 0.399152 0.321023 ... 0.413266 0.235316 0.409336 0.375150 0.309375 0.331573 0.342340 0.296498 0.371933 0.367972
aff1 0.201742 0.365337 0.280973 1.000000 0.666506 0.286647 0.303003 0.340900 0.345766 0.246487 ... 0.348551 0.305927 0.284400 0.351595 0.395420 0.321056 0.273604 0.273012 0.240874 0.334227
aff4 0.192320 0.493897 0.423670 0.666506 1.000000 0.270260 0.341720 0.328680 0.373705 0.330691 ... 0.371609 0.457508 0.395476 0.405121 0.454660 0.382793 0.412013 0.357532 0.387004 0.416138
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
zscan5a 0.350433 0.447373 0.331573 0.321056 0.382793 0.197964 0.222432 0.256884 0.258526 0.240125 ... 0.800244 0.448363 0.864130 0.472762 0.524527 1.000000 0.263888 0.631232 0.378192 0.234377
zta 0.288472 0.335509 0.342340 0.273604 0.412013 0.152200 0.206198 0.247921 0.304893 0.282786 ... 0.301469 0.245819 0.289613 0.258759 0.282314 0.263888 1.000000 0.166201 0.248931 0.462602
zxdb 0.293085 0.425058 0.296498 0.273012 0.357532 0.254524 0.258319 0.234817 0.229969 0.192176 ... 0.605518 0.551410 0.622243 0.461396 0.460363 0.631232 0.166201 1.000000 0.412470 0.193181
zxdc 0.213696 0.357726 0.371933 0.240874 0.387004 0.256106 0.270268 0.615022 0.561938 0.310942 ... 0.365331 0.326439 0.403753 0.361857 0.304385 0.378192 0.248931 0.412470 1.000000 0.234095
zzz3 0.296846 0.430030 0.367972 0.334227 0.416138 0.066830 0.145569 0.190414 0.255840 0.262778 ... 0.288327 0.254510 0.233906 0.311202 0.300828 0.234377 0.462602 0.193181 0.234095 1.000000

1071 rows × 1071 columns

[19]:
# Infer dual-functional regulator subnetwork (If --dual-regulator was provided, saved in {odir}/results/dual_regulator_subnetwork.pdf)
import numpy as np
thre_func1 = np.percentile(reg_cos_sim_func1.values.flatten(), 95)
thre_func2 = np.percentile(reg_cos_sim_func2.values.flatten(), 95)

assert (reg_cos_sim_func1.index == reg_cos_sim_func2.index).all()
df_cos_reg = pd.DataFrame(
                index=reg_cos_sim_func1.index,
                data={
                    "function1": reg_cos_sim_func1.loc['ezh2', :], # ezh2 (dual-functional regulator)
                    "function2": reg_cos_sim_func2.loc['ezh2', :], # ezh2 (dual-functional regulator)
                },
            )
df_cos_reg["diff"] = df_cos_reg["function1"] - df_cos_reg["function2"]
df_candidate = df_cos_reg[df_cos_reg["diff"].abs() > 0.1]
topN_pos = df_candidate.query("function1 > @thre_func1").index.values
topN_neg = df_candidate.query("function2 > @thre_func2").index.values
top_pairs = np.union1d(topN_pos, topN_neg)


[20]:
df_candidate
[20]:
function1 function2 diff
arnt2 0.257709 0.368453 -0.110744
arntl 0.343475 0.462813 -0.119337
atf3 0.430633 0.533162 -0.102530
barhl1 0.295013 0.425979 -0.130966
batf 0.340393 0.458829 -0.118436
... ... ... ...
znf433 0.220247 0.330745 -0.110498
znf571 0.252793 0.367852 -0.115058
znf706 0.262874 0.376015 -0.113141
znf860 0.201699 0.338688 -0.136989
zscan2 0.322358 0.452522 -0.130163

125 rows × 3 columns

[21]:
thre_func1, thre_func2
[21]:
(0.6205606986763784, 0.5701609889810152)
[22]:
# function1_sunbetwork
df_candidate.loc[topN_pos]
[22]:
function1 function2 diff
h2ak119ub 0.724501 0.621556 0.102945
pcgf1 0.653966 0.550944 0.103022
[23]:
# function2_sunbetwork
df_candidate.loc[topN_neg]
[23]:
function1 function2 diff
brca1 0.482059 0.608059 -0.125999
ep300 0.496610 0.612843 -0.116234
foxm1 0.469633 0.625345 -0.155713
h2ak119ub 0.724501 0.621556 0.102945
hinfp 0.491919 0.600521 -0.108602
junb 0.445172 0.579780 -0.134608
kat5 0.469790 0.572971 -0.103180
med1 0.467571 0.589410 -0.121839
rela 0.474658 0.591733 -0.117075
smad4 0.447686 0.578598 -0.130911
stat3 0.529353 0.642772 -0.113420
[ ]: