Build a model to predict region activity¶
This notebook shows how to use the ChromBERT-tools Python API region_activity_regression to build a model to predict the region activity.
For the bash command-line usage, see `examples/cli/region_activity_regression.ipynb <../cli/region_activity_regression.ipynb>`__.
For more details, please refer to the `region_activity_regression <https://chrombert-tools.readthedocs.io/en/latest/commands/region_activity_regression.html>`__ command documentation
1. Build model¶
[1]:
from chrombert_tools import region_activity_regression
Predict the foldchange of region activity between cell-state-transitions¶
[ ]:
# Download example data
# Myoblast and fibroblast data: ATAC-seq bigWig and peak files
import subprocess
import os
if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
subprocess.run(cmd, shell=True)
cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
subprocess.run(cmd, shell=True)
if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
subprocess.run(cmd, shell=True)
if not os.path.exists('../data/fibroblast_ENCFF184KAM_peak.bed'):
cmd = f'wget https://www.encodeproject.org/files/ENCFF184KAM/@@download/ENCFF184KAM.bed.gz -O ../data/fibroblast_ENCFF184KAM_peak.bed.gz'
subprocess.run(cmd, shell=True)
cmd = f"gzip -d ../data/fibroblast_ENCFF184KAM_peak.bed.gz"
subprocess.run(cmd, shell=True)
if not os.path.exists('../data/fibroblast_ENCFF361BTT_signal.bigwig'):
cmd = 'wget https://www.encodeproject.org/files/ENCFF361BTT/@@download/ENCFF361BTT.bigWig -O ../data/fibroblast_ENCFF361BTT_signal.bigwig'
subprocess.run(cmd, shell=True)
[ ]:
results = region_activity_regression(
acc_peak1="../data/fibroblast_ENCFF184KAM_peak.bed", # cell-state 1 peak file: fibroblast peak file
acc_peak2="../data/myoblast_ENCFF647RNC_peak.bed", # cell-state 2 peak file: myoblast peak file
acc_signal1="../data/fibroblast_ENCFF361BTT_signal.bigwig", # cell-state 1 signal file: fibroblast signal file
acc_signal2="../data/myoblast_ENCFF149ERN_signal.bigwig", # cell-state 2 signal file: myoblast signal file
genome="hg38", # genome
resolution="1kb", # resolution
odir="output_region_activity_regression", # output directory
include_tss_background = True, # include the background regions from TSS, we use the same background regions for both cell-states
)
Stage 1: prepare chromatin accessibility dataset
Processing stage 1: prepare chromatin accessibility dataset
Mode: two states (fold-change label)
TSS ± flank background regions: enabled, flank distance: 10000 bp
[state1_0] Region summary - total: 287892, overlapping with ChromBERT: 284160 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 5967
[state1] merged 1 peak file(s) → 250143 unique ChromBERT regions (after overlap + dedup)
[state2_0] Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
[state2] merged 1 peak file(s) → 324690 unique ChromBERT regions (after overlap + dedup)
[tss_bg] Region summary - total: 19260, overlapping with ChromBERT: 333463 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 19260
total region: 671022
Fast mode: downsampled to ~20000 regions
Finished stage 1
Stage 2: train ChromBERT (accessibility fold change, two states)
[Attempt 0/2] seed=55
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
Trainer will use only 1 of 4 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=4)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Loading `train_dataloader` to estimate number of stepping batches.
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/lightning/pytorch/utilities/model_summary/model_summary.py:231: Precision bf16-mixed is not supported by the model summary. Estimated model size in MB will not be accurate. Using 32 bits instead.
| Name | Type | Params | Mode
---------------------------------------------------
0 | model | ChromBERTGeneral | 62.8 M | train
---------------------------------------------------
18.9 M Trainable params
43.9 M Non-trainable params
62.8 M Total params
251.095 Total estimated model params size (MB)
153 Modules in train mode
0 Modules in eval mode
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:484: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
Metric default_validation/pcc improved. New best score: 0.142
Metric default_validation/pcc improved by 0.068 >= min_delta = 0.01. New best score: 0.210
Metric default_validation/pcc improved by 0.441 >= min_delta = 0.01. New best score: 0.650
Metric default_validation/pcc improved by 0.186 >= min_delta = 0.01. New best score: 0.837
Metric default_validation/pcc improved by 0.048 >= min_delta = 0.01. New best score: 0.885
Metric default_validation/pcc improved by 0.015 >= min_delta = 0.01. New best score: 0.900
Metric default_validation/pcc improved by 0.016 >= min_delta = 0.01. New best score: 0.917
Monitored metric default_validation/pcc did not improve in the last 5 records. Best score: 0.917. Signaling Trainer to stop.
Evaluating the finetuned model performance
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
warnings.warn(*args, **kwargs)
ft_ckpt: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt, test_metrics: {'pearsonr': 0.883388340473175, 'spearmanr': 0.8222081065177917, 'mse': 0.3158803880214691, 'mae': 0.39593231678009033, 'r2': 0.7368538771517714}
Attempt metrics: pearsonr=0.883388340473175
Accepted run (pearsonr=0.8834 >= 0.2).
Finished stage 2: obtained a fine-tuned ChromBERT
Best pearsonr=0.883388340473175, metrics={'pearsonr': 0.883388340473175, 'spearmanr': 0.8222081065177917, 'mse': 0.3158803880214691, 'mae': 0.39593231678009033, 'r2': 0.7368538771517714, 'ft_ckpt': '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'}
Finished stage 2 (trained)
Stage 3: Predicting
Region summary - total: 2000, overlapping with ChromBERT: 2000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Predict input: output_region_activity_regression/predict/model_input.tsv
Predicting: 100%|██████████| 501/501 [01:41<00:00, 4.92it/s]
Predictions saved: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/predict/predictions.csv (2002 regions)
Finished stage 3
============================================================
All stages completed!
============================================================
Fine-tuned checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Predictions: output_region_activity_regression/predict/predictions.csv
[ ]:
predict_df = results.predictions_df # test prediction
predict_df.head()
| chrom | start | end | build_region_index | start_input | end_input | predicted_value | true_label | |
|---|---|---|---|---|---|---|---|---|
| 0 | chr1 | 1012000 | 1013000 | 304 | 1012000 | 1013000 | -0.082217 | -0.254314 |
| 1 | chr1 | 1065000 | 1066000 | 357 | 1065000 | 1066000 | -0.416241 | -0.024423 |
| 2 | chr1 | 1318000 | 1319000 | 597 | 1318000 | 1319000 | -0.882017 | -1.585025 |
| 3 | chr1 | 2035000 | 2036000 | 1247 | 2035000 | 2036000 | 2.072878 | 2.632590 |
| 4 | chr1 | 2556000 | 2557000 | 1737 | 2556000 | 2557000 | -1.565579 | -0.975630 |
[ ]:
# calculate the pearson and spearman correlation
from scipy.stats import pearsonr,spearmanr
pearsonr(predict_df.true_label,predict_df.predicted_value),spearmanr(predict_df.true_label,predict_df.predicted_value)
(PearsonRResult(statistic=0.8833282059899981, pvalue=0.0),
SignificanceResult(statistic=0.8223618088793297, pvalue=0.0))
[ ]:
model_config = results.model_config # model configuration
data_config = results.data_config # data configuration
model_config,data_config
('output_region_activity_regression/model_config.json',
'output_region_activity_regression/dataset_config.json')
[ ]:
ft_ckpt = results.model_ckpt # fine-tuned model checkpoint
# ft_ckpt = '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'
ft_ckpt
'/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'
2.Interpretation¶
After building a model, we can use the interpretation layer of ChromBERT-tools to derive biological insights.
Here, we identify important factors between different region groups as an example.
[ ]:
region1_file = "./output_region_activity_regression/dataset/up.csv" # Regions with increased activity generated by the command above
region2_file = "./output_region_activity_regression/dataset/nochange.csv" # Unchanged regions generated iby the command above
[ ]:
from chrombert_tools import interpret_regulator_effects_between_region_groups as run_key_regulator
factor_importance_rank = run_key_regulator(
region1_file=region1_file, # your focus region group 1
region2_file=region2_file, # your focus region group 2
odir="output_interpret_regulator_transition_myoblast", # output directory
genome="hg38", # genome
resolution="1kb", # resolution
ft_ckpt=ft_ckpt, # fine-tuned checkpoint in step 1
batch_size=64, # batch size
# model_config='./output_region_activity_regression/model_config.json',
# data_config='./output_region_activity_regression/dataset_config.json',
model_config=results.model_config, # model configuration in Step 1
data_config=results.data_config, # data configuration in Step 1
)
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 300/300 [00:22<00:00, 13.28it/s]
100%|██████████| 290/290 [00:20<00:00, 13.92it/s]
Identify key regulators across regions(top 25)
factors similarity rank embedding_shift
0 pax3-foxo1a 0.168074 1 0.831926
1 myog 0.179956 2 0.820044
2 myf5 0.186609 3 0.813391
3 myod1 0.272119 4 0.727881
4 pax7 0.296411 5 0.703589
5 dux4 0.373782 6 0.626218
6 tead1 0.414811 7 0.585189
7 neurod1 0.440464 8 0.559536
8 ascl1 0.449937 9 0.550063
9 six2 0.459578 10 0.540422
10 neurog2 0.463017 11 0.536983
11 pgbd3 0.531903 12 0.468097
12 yap1 0.556353 13 0.443647
13 mycn 0.567710 14 0.432290
14 tcf21 0.605918 15 0.394082
15 pou3f2 0.639499 16 0.360501
16 six1 0.648377 17 0.351623
17 pax6 0.679352 18 0.320648
18 chd4 0.692183 19 0.307817
19 tbx5 0.697478 20 0.302522
20 ss18 0.703591 21 0.296409
21 znf92 0.709506 22 0.290494
22 ep400 0.716686 23 0.283314
23 tead 0.728783 24 0.271217
24 mef2b 0.731181 25 0.268819
Finished
Used fine-tuned ChromBERT checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Key regulators across regions saved to: output_interpret_regulator_transition_myoblast/results/factor_importance_rank.csv
[8]:
factor_importance_rank.head(n=25)
[8]:
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | pax3-foxo1a | 0.168074 | 1 | 0.831926 |
| 1 | myog | 0.179956 | 2 | 0.820044 |
| 2 | myf5 | 0.186609 | 3 | 0.813391 |
| 3 | myod1 | 0.272119 | 4 | 0.727881 |
| 4 | pax7 | 0.296411 | 5 | 0.703589 |
| 5 | dux4 | 0.373782 | 6 | 0.626218 |
| 6 | tead1 | 0.414811 | 7 | 0.585189 |
| 7 | neurod1 | 0.440464 | 8 | 0.559536 |
| 8 | ascl1 | 0.449937 | 9 | 0.550063 |
| 9 | six2 | 0.459578 | 10 | 0.540422 |
| 10 | neurog2 | 0.463017 | 11 | 0.536983 |
| 11 | pgbd3 | 0.531903 | 12 | 0.468097 |
| 12 | yap1 | 0.556353 | 13 | 0.443647 |
| 13 | mycn | 0.567710 | 14 | 0.432290 |
| 14 | tcf21 | 0.605918 | 15 | 0.394082 |
| 15 | pou3f2 | 0.639499 | 16 | 0.360501 |
| 16 | six1 | 0.648377 | 17 | 0.351623 |
| 17 | pax6 | 0.679352 | 18 | 0.320648 |
| 18 | chd4 | 0.692183 | 19 | 0.307817 |
| 19 | tbx5 | 0.697478 | 20 | 0.302522 |
| 20 | ss18 | 0.703591 | 21 | 0.296409 |
| 21 | znf92 | 0.709506 | 22 | 0.290494 |
| 22 | ep400 | 0.716686 | 23 | 0.283314 |
| 23 | tead | 0.728783 | 24 | 0.271217 |
| 24 | mef2b | 0.731181 | 25 | 0.268819 |
[ ]: