Identify context-specific cofactors in different regions.¶
The predict_regulator_context_cofactors command trains a classifier to distinguish two sets of genomic regions, identifies regulatory factors that contribute most to the classification, and prioritizes context-specific cofactors for dual-functional target factors across the two contexts.
Note: The remaining examples will only show the direct command usage.
If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.
For more details, please refer to the `predict_regulator_context_cofactors <https://chrombert-tools.readthedocs.io/en/latest/commands/predict_regulator_context_cofactors.html>`__ command documentation
Example:¶
EZH2, the catalytic subunit of Polycomb Repressive Complex 2 (PRC2), operates in two distinct modes: a classical H3K27me3-dependent repressive role and a non-classical H3K27me3-independent role.
We fine-tuned ChromBERT to classify EZH2 ChIP-seq peaks in human embryonic stem cells into classical and non-classical categories.
Using this fine-tuned model, we identify regulators that most strongly distinguish the two region sets and visualize context-specific EZH2 cofactor subnetworks.
[7]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]='0'
[1]:
# options parameter
!chrombert-tools predict_regulator_context_cofactors -h
Usage: chrombert-tools predict_regulator_context_cofactors
[OPTIONS]
Identify context-specific cofactors across two or more region-function
classes for user-specified dual-functional regulators.
Typical usage
Provide two or more functional region classes with --function-bed to define
the region groups to be compared.
Provide --dual-regulator to specify the dual-functional regulator of
interest. The workflow then compares cofactor patterns across the defined
region classes for that regulator.
Output
For every pair of classes, the workflow writes results to:
{odir}/results/<label_pair_subdir>/
Options:
--function-bed TEXT BED file(s) defining one functional region
class. Repeat this option to provide
multiple classes. Use ';' to combine
multiple BED files into the same class. If
only one class is provided, a background
class will be added automatically.
--function-mode [and|or] How to combine multiple BED files within one
class: 'and' for intersection, 'or' for
union. Provide one value to reuse for all
classes, or repeat to match each --function-
bed. Default: and.
--function-name TEXT Optional display name for each function
class. Repeat to match each --function-bed.
If not provided, default names such as
function_0, function_1, ... are used.
--dual-regulator TEXT Regulator(s) of special interest for dual-
function analysis. Use ';' to separate
multiple regulators. [required]
--ignore-regulator TEXT Regulator(s) to exclude from analysis. Use
';' to separate multiple regulators.
--odir DIRECTORY Output directory for datasets, embeddings,
trained models, and pairwise comparison
results. [default: ./output]
--genome [hg38|mm10] Reference genome used by ChromBERT
resources. [default: hg38]
--resolution [200bp|1kb|2kb|4kb]
Resolution of the ChromBERT representation.
[default: 1kb]
--ft-ckpt FILE Path to a fine-tuned ChromBERT checkpoint.
If provided, the workflow uses this
checkpoint instead of training a new model.
--threshold FLOAT Threshold for the embedding similarity
difference between two regions used to
identify a regulator as a context-dependent
cofactor. [default: 0.1]
--quantile FLOAT Quantile threshold for cosine similarity
edges. Default: 0.95. [default: 0.95]
--batch-size INTEGER Batch size used for embedding generation and
model inference. [default: 4]
--mode [fast|full] Run mode. 'fast' uses a sampled subset of
regions for quicker execution, while 'full'
uses all eligible regions. [default: fast]
--chrombert-cache-dir DIRECTORY
Directory containing ChromBERT cached
resources, such as config files,
annotations, checkpoints, and metadata.
[default: /mnt/Storage/home/chenqianqian/.ca
che/chrombert/data]
-h, --help Show this message and exit.
Run¶
[4]:
!mkdir -p ./tmp
[ ]:
# takes approximately 20-40 minutes to run
# --function-bed: bed files of focus regions, Repeat this option to provide multiple classes.
# --function-name: Name of each functional region class.
# --dual-regulator: dual-functional regulator
# --ignore-regulator: Regulators to mask during fine-tuning
!chrombert-tools predict_regulator_context_cofactors \
--function-bed "../data/hESC_GSM1003524_EZH2.bed;../data/hESC_GSM1498900_H3K27me3.bed" \
--function-bed "../data/hESC_GSM1003524_EZH2.bed" \
--function-name "EZH2-H3K27me3" \
--function-name "EZH2" \
--dual-regulator "EZH2" \
--ignore-regulator "H3K27me3;H3K27me3/H3K4me3" \
--odir "./output_find_context_specific_cofactor" \
--genome "hg38" \
--resolution "1kb" 2> "./tmp/predict_regulator_context_cofactors.log" # redirect stderr to log file
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 2, matched in ChromBERT: 2, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Note: All regulator names were converted to lowercase for matching.
Regulator count summary - requested: 1, matched in ChromBERT: 1, not found: 0, not found regulator: []
ChromBERT regulators: /mnt/Storage/home/chenqianqian/.cache/chrombert/data/config/hg38_6k_regulators_list.txt
Step 1/3: Preparing labeled region dataset...
[EZH2-H3K27me3 | hESC_GSM1003524_EZH2.bed] Region summary - total: 11896, overlapping with ChromBERT: 12101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 325
[EZH2-H3K27me3 | hESC_GSM1498900_H3K27me3.bed] Region summary - total: 8875, overlapping with ChromBERT: 14135 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 2878
EZH2-H3K27me3 (class 0): 5736 regions
[EZH2] Region summary - total: 11896, overlapping with ChromBERT: 12101 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 325
EZH2 (class 1): 11008 regions
EZH2: removed 5736 overlapping regions
EZH2-H3K27me3: 5736 (final)
EZH2: 5272 (final)
Total: 11008
Fast mode: ~10000 regions per class
Finished step 1: labeled dataset prepared.
Step 2/3: Loading or training the model...
Stage 2: Fine-tuning (binary classification)
[Attempt 0/2] seed=55
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Epoch 0: 20%|████▍ | 440/2202 [01:14<05:00, 5.87it/s, v_num=0]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.60it/s]
Epoch 0: 40%|▍| 880/2202 [02:42<04:04, 5.40it/s, v_num=0, default_validation/b
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.51it/s]
Epoch 0: 60%|▌| 1320/2202 [04:11<02:47, 5.26it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.48it/s]
Epoch 0: 80%|▊| 1760/2202 [05:39<01:25, 5.18it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.53it/s]
Epoch 0: 100%|▉| 2200/2202 [07:07<00:00, 5.14it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.52it/s]
Epoch 1: 20%|▏| 440/2202 [01:13<04:55, 5.96it/s, v_num=0, default_validation/b
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.49it/s]
Epoch 1: 40%|▍| 880/2202 [02:42<04:03, 5.43it/s, v_num=0, default_validation/b
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.55it/s]
Epoch 1: 60%|▌| 1320/2202 [04:09<02:46, 5.30it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.42it/s]
Epoch 1: 80%|▊| 1760/2202 [05:37<01:24, 5.21it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.46it/s]
Epoch 1: 100%|▉| 2200/2202 [07:06<00:00, 5.16it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.49it/s]
Epoch 2: 20%|▏| 440/2202 [01:13<04:55, 5.97it/s, v_num=0, default_validation/b
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.52it/s]
Epoch 2: 40%|▍| 880/2202 [02:40<04:01, 5.48it/s, v_num=0, default_validation/b
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.45it/s]
Epoch 2: 60%|▌| 1320/2202 [04:07<02:45, 5.33it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.53it/s]
Epoch 2: 80%|▊| 1760/2202 [05:35<01:24, 5.24it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/138 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 138/138 [00:13<00:00, 10.52it/s]
Epoch 2: 80%|▊| 1760/2202 [05:50<01:28, 5.02it/s, v_num=0, default_validation/
Evaluating the finetuned model performance
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_context_specific_cofactor/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=97.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
ft_ckpt: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_context_specific_cofactor/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=97.ckpt, test_metrics: {'auroc': 0.8930243849754333, 'auprc': 0.8924592733383179, 'mcc': 0.5851843357086182, 'f1': 0.7249712347984314, 'precision': 0.8974359035491943, 'recall': 0.6081081032752991}
Attempt metrics: auprc=0.8924592733383179
Accepted run (auprc=0.8925 >= 0.2).
Finished stage 2: obtained a fine-tuned ChromBERT
Best auprc=0.8924592733383179, metrics={'auroc': 0.8930243849754333, 'auprc': 0.8924592733383179, 'mcc': 0.5851843357086182, 'f1': 0.7249712347984314, 'precision': 0.8974359035491943, 'recall': 0.6081081032752991, 'ft_ckpt': '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_context_specific_cofactor/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=97.ckpt'}
Finished step 2: model ready for embedding generation.
Step 3/3: Generating embeddings and running pairwise cofactor comparisons...
Comparing 'EZH2-H3K27me3' vs 'EZH2' ...
Results will be written to: results/0_1_EZH2-H3K27me3_vs_EZH2/
Total graph nodes: 1017
Total graph edges (threshold=0.619): 28650
Total graph nodes: 1035
Total graph edges (threshold=0.568): 28650
Dual regulator subnetwork plot saved: ./output_find_context_specific_cofactor/results/0_1_EZH2-H3K27me3_vs_EZH2/dual_regulator_ezh2_subnetwork.pdf
Yellow color represents function1 subnetwork; blue color represents function2 subnetwork.
Finished step: infer dual-functional regulator subnetworks
Finished step 3: all pairwise comparisons completed.
Compared 2 classes and generated 1 pairwise results.
Final results are available under: ./output_find_context_specific_cofactor/results/<label_pair_subdir>/
[17]:
# regulator_cosine_similarity_on_function1_regions.csv: cosine similarity of regulator-regulator pairs on function1 regions:
# - node1: regulator name
# - node2: regulator name
# - similarity: cosine similarity of regulator embeddings between function1 regions
# regulator_cosine_similarity_on_function2_regions.csv: cosine similarity of regulator-regulator pairs on function2 regions:
# - node1: regulator name
# - node2: regulator name
# - similarity: cosine similarity of regulator embeddings between function2 regions
import pandas as pd
reg_cos_sim_func1 = pd.read_csv("./output_find_context_specific_cofactor/results/0_1_EZH2-H3K27me3_vs_EZH2/func1/regulator_cosine_similarity.tsv",sep="\t",index_col=0)
reg_cos_sim_func2 = pd.read_csv("./output_find_context_specific_cofactor/results/0_1_EZH2-H3K27me3_vs_EZH2/func2/regulator_cosine_similarity.tsv",sep="\t",index_col=0)
[18]:
reg_cos_sim_func2
[18]:
| 5hmc | adnp | aebp2 | aff1 | aff4 | ago1 | ago2 | ahr | ahrr | alkbh3 | ... | zscan20 | zscan22 | zscan23 | zscan29 | zscan31 | zscan5a | zta | zxdb | zxdc | zzz3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5hmc | 1.000000 | 0.263275 | 0.312532 | 0.201742 | 0.192320 | 0.279180 | 0.241895 | 0.216915 | 0.186114 | 0.227855 | ... | 0.383006 | 0.135186 | 0.437668 | 0.285964 | 0.172695 | 0.350433 | 0.288472 | 0.293085 | 0.213696 | 0.296846 |
| adnp | 0.263275 | 1.000000 | 0.656210 | 0.365337 | 0.493897 | 0.160843 | 0.232765 | 0.294296 | 0.314429 | 0.291633 | ... | 0.393143 | 0.353003 | 0.448869 | 0.544043 | 0.380984 | 0.447373 | 0.335509 | 0.425058 | 0.357726 | 0.430030 |
| aebp2 | 0.312532 | 0.656210 | 1.000000 | 0.280973 | 0.423670 | 0.248505 | 0.309748 | 0.337397 | 0.399152 | 0.321023 | ... | 0.413266 | 0.235316 | 0.409336 | 0.375150 | 0.309375 | 0.331573 | 0.342340 | 0.296498 | 0.371933 | 0.367972 |
| aff1 | 0.201742 | 0.365337 | 0.280973 | 1.000000 | 0.666506 | 0.286647 | 0.303003 | 0.340900 | 0.345766 | 0.246487 | ... | 0.348551 | 0.305927 | 0.284400 | 0.351595 | 0.395420 | 0.321056 | 0.273604 | 0.273012 | 0.240874 | 0.334227 |
| aff4 | 0.192320 | 0.493897 | 0.423670 | 0.666506 | 1.000000 | 0.270260 | 0.341720 | 0.328680 | 0.373705 | 0.330691 | ... | 0.371609 | 0.457508 | 0.395476 | 0.405121 | 0.454660 | 0.382793 | 0.412013 | 0.357532 | 0.387004 | 0.416138 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| zscan5a | 0.350433 | 0.447373 | 0.331573 | 0.321056 | 0.382793 | 0.197964 | 0.222432 | 0.256884 | 0.258526 | 0.240125 | ... | 0.800244 | 0.448363 | 0.864130 | 0.472762 | 0.524527 | 1.000000 | 0.263888 | 0.631232 | 0.378192 | 0.234377 |
| zta | 0.288472 | 0.335509 | 0.342340 | 0.273604 | 0.412013 | 0.152200 | 0.206198 | 0.247921 | 0.304893 | 0.282786 | ... | 0.301469 | 0.245819 | 0.289613 | 0.258759 | 0.282314 | 0.263888 | 1.000000 | 0.166201 | 0.248931 | 0.462602 |
| zxdb | 0.293085 | 0.425058 | 0.296498 | 0.273012 | 0.357532 | 0.254524 | 0.258319 | 0.234817 | 0.229969 | 0.192176 | ... | 0.605518 | 0.551410 | 0.622243 | 0.461396 | 0.460363 | 0.631232 | 0.166201 | 1.000000 | 0.412470 | 0.193181 |
| zxdc | 0.213696 | 0.357726 | 0.371933 | 0.240874 | 0.387004 | 0.256106 | 0.270268 | 0.615022 | 0.561938 | 0.310942 | ... | 0.365331 | 0.326439 | 0.403753 | 0.361857 | 0.304385 | 0.378192 | 0.248931 | 0.412470 | 1.000000 | 0.234095 |
| zzz3 | 0.296846 | 0.430030 | 0.367972 | 0.334227 | 0.416138 | 0.066830 | 0.145569 | 0.190414 | 0.255840 | 0.262778 | ... | 0.288327 | 0.254510 | 0.233906 | 0.311202 | 0.300828 | 0.234377 | 0.462602 | 0.193181 | 0.234095 | 1.000000 |
1071 rows × 1071 columns
[19]:
# Infer dual-functional regulator subnetwork (If --dual-regulator was provided, saved in {odir}/results/dual_regulator_subnetwork.pdf)
import numpy as np
thre_func1 = np.percentile(reg_cos_sim_func1.values.flatten(), 95)
thre_func2 = np.percentile(reg_cos_sim_func2.values.flatten(), 95)
assert (reg_cos_sim_func1.index == reg_cos_sim_func2.index).all()
df_cos_reg = pd.DataFrame(
index=reg_cos_sim_func1.index,
data={
"function1": reg_cos_sim_func1.loc['ezh2', :], # ezh2 (dual-functional regulator)
"function2": reg_cos_sim_func2.loc['ezh2', :], # ezh2 (dual-functional regulator)
},
)
df_cos_reg["diff"] = df_cos_reg["function1"] - df_cos_reg["function2"]
df_candidate = df_cos_reg[df_cos_reg["diff"].abs() > 0.1]
topN_pos = df_candidate.query("function1 > @thre_func1").index.values
topN_neg = df_candidate.query("function2 > @thre_func2").index.values
top_pairs = np.union1d(topN_pos, topN_neg)
[20]:
df_candidate
[20]:
| function1 | function2 | diff | |
|---|---|---|---|
| arnt2 | 0.257709 | 0.368453 | -0.110744 |
| arntl | 0.343475 | 0.462813 | -0.119337 |
| atf3 | 0.430633 | 0.533162 | -0.102530 |
| barhl1 | 0.295013 | 0.425979 | -0.130966 |
| batf | 0.340393 | 0.458829 | -0.118436 |
| ... | ... | ... | ... |
| znf433 | 0.220247 | 0.330745 | -0.110498 |
| znf571 | 0.252793 | 0.367852 | -0.115058 |
| znf706 | 0.262874 | 0.376015 | -0.113141 |
| znf860 | 0.201699 | 0.338688 | -0.136989 |
| zscan2 | 0.322358 | 0.452522 | -0.130163 |
125 rows × 3 columns
[21]:
thre_func1, thre_func2
[21]:
(0.6205606986763784, 0.5701609889810152)
[22]:
# function1_sunbetwork
df_candidate.loc[topN_pos]
[22]:
| function1 | function2 | diff | |
|---|---|---|---|
| h2ak119ub | 0.724501 | 0.621556 | 0.102945 |
| pcgf1 | 0.653966 | 0.550944 | 0.103022 |
[23]:
# function2_sunbetwork
df_candidate.loc[topN_neg]
[23]:
| function1 | function2 | diff | |
|---|---|---|---|
| brca1 | 0.482059 | 0.608059 | -0.125999 |
| ep300 | 0.496610 | 0.612843 | -0.116234 |
| foxm1 | 0.469633 | 0.625345 | -0.155713 |
| h2ak119ub | 0.724501 | 0.621556 | 0.102945 |
| hinfp | 0.491919 | 0.600521 | -0.108602 |
| junb | 0.445172 | 0.579780 | -0.134608 |
| kat5 | 0.469790 | 0.572971 | -0.103180 |
| med1 | 0.467571 | 0.589410 | -0.121839 |
| rela | 0.474658 | 0.591733 | -0.117075 |
| smad4 | 0.447686 | 0.578598 | -0.130911 |
| stat3 | 0.529353 | 0.642772 | -0.113420 |
[ ]: