Build a model to predict region activity

This notebook shows how to use the ChromBERT-tools Python API region_activity_regression to build a model to predict the region activity.

For the bash command-line usage, see `examples/cli/region_activity_regression.ipynb <../cli/region_activity_regression.ipynb>`__.

For more details, please refer to the `region_activity_regression <https://chrombert-tools.readthedocs.io/en/latest/commands/region_activity_regression.html>`__ command documentation

1. Build model

[1]:
from chrombert_tools import region_activity_regression

Predict the foldchange of region activity between cell-state-transitions

[ ]:
# Download example data
# Myoblast and fibroblast data: ATAC-seq bigWig and peak files
import subprocess
import os
if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
    subprocess.run(cmd, shell=True)
    cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
    subprocess.run(cmd, shell=True)

if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
    subprocess.run(cmd, shell=True)


if not os.path.exists('../data/fibroblast_ENCFF184KAM_peak.bed'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF184KAM/@@download/ENCFF184KAM.bed.gz -O ../data/fibroblast_ENCFF184KAM_peak.bed.gz'
    subprocess.run(cmd, shell=True)
    cmd = f"gzip -d ../data/fibroblast_ENCFF184KAM_peak.bed.gz"
    subprocess.run(cmd, shell=True)


if not os.path.exists('../data/fibroblast_ENCFF361BTT_signal.bigwig'):
    cmd = 'wget https://www.encodeproject.org/files/ENCFF361BTT/@@download/ENCFF361BTT.bigWig -O ../data/fibroblast_ENCFF361BTT_signal.bigwig'
    subprocess.run(cmd, shell=True)
[ ]:
results = region_activity_regression(
    acc_peak1="../data/fibroblast_ENCFF184KAM_peak.bed", # cell-state 1 peak file: fibroblast peak file
    acc_peak2="../data/myoblast_ENCFF647RNC_peak.bed", # cell-state 2 peak file: myoblast peak file
    acc_signal1="../data/fibroblast_ENCFF361BTT_signal.bigwig", # cell-state 1 signal file: fibroblast signal file
    acc_signal2="../data/myoblast_ENCFF149ERN_signal.bigwig", # cell-state 2 signal file: myoblast signal file
    genome="hg38", # genome
    resolution="1kb", # resolution
    odir="output_region_activity_regression", # output directory
    include_tss_background = True, # include the background regions from TSS, we use the same background regions for both cell-states
)
Stage 1: prepare chromatin accessibility dataset
Processing stage 1: prepare chromatin accessibility dataset
  Mode: two states (fold-change label)
  TSS ± flank background regions: enabled, flank distance: 10000 bp
[state1_0] Region summary - total: 287892, overlapping with ChromBERT: 284160 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 5967
  [state1] merged 1 peak file(s) → 250143 unique ChromBERT regions (after overlap + dedup)
[state2_0] Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
  [state2] merged 1 peak file(s) → 324690 unique ChromBERT regions (after overlap + dedup)
[tss_bg] Region summary - total: 19260, overlapping with ChromBERT: 333463 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 19260
 total region: 671022
  Fast mode: downsampled to ~20000 regions
Finished stage 1
Stage 2: train ChromBERT (accessibility fold change, two states)

[Attempt 0/2] seed=55
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
Trainer will use only 1 of 4 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=4)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA A100-PCIE-40GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
Loading `train_dataloader` to estimate number of stepping batches.
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/lightning/pytorch/utilities/model_summary/model_summary.py:231: Precision bf16-mixed is not supported by the model summary.  Estimated model size in MB will not be accurate. Using 32 bits instead.

  | Name  | Type             | Params | Mode
---------------------------------------------------
0 | model | ChromBERTGeneral | 62.8 M | train
---------------------------------------------------
18.9 M    Trainable params
43.9 M    Non-trainable params
62.8 M    Total params
251.095   Total estimated model params size (MB)
153       Modules in train mode
0         Modules in eval mode
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:484: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
Metric default_validation/pcc improved. New best score: 0.142
Metric default_validation/pcc improved by 0.068 >= min_delta = 0.01. New best score: 0.210
Metric default_validation/pcc improved by 0.441 >= min_delta = 0.01. New best score: 0.650
Metric default_validation/pcc improved by 0.186 >= min_delta = 0.01. New best score: 0.837
Metric default_validation/pcc improved by 0.048 >= min_delta = 0.01. New best score: 0.885
Metric default_validation/pcc improved by 0.015 >= min_delta = 0.01. New best score: 0.900
Metric default_validation/pcc improved by 0.016 >= min_delta = 0.01. New best score: 0.917
Monitored metric default_validation/pcc did not improve in the last 5 records. Best score: 0.917. Signaling Trainer to stop.
Evaluating the finetuned model performance
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
/mnt/Storage/home/chenqianqian/miniconda3/envs/chrombert/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: Metric `SpearmanCorrcoef` will save all targets and predictions in the buffer. For large datasets, this may lead to large memory footprint.
  warnings.warn(*args, **kwargs)
ft_ckpt: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt, test_metrics: {'pearsonr': 0.883388340473175, 'spearmanr': 0.8222081065177917, 'mse': 0.3158803880214691, 'mae': 0.39593231678009033, 'r2': 0.7368538771517714}
Attempt metrics: pearsonr=0.883388340473175
Accepted run (pearsonr=0.8834 >= 0.2).

Finished stage 2: obtained a fine-tuned ChromBERT
Best pearsonr=0.883388340473175, metrics={'pearsonr': 0.883388340473175, 'spearmanr': 0.8222081065177917, 'mse': 0.3158803880214691, 'mae': 0.39593231678009033, 'r2': 0.7368538771517714, 'ft_ckpt': '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'}
Finished stage 2 (trained)
Stage 3: Predicting
Region summary - total: 2000, overlapping with ChromBERT: 2000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
  Predict input: output_region_activity_regression/predict/model_input.tsv
Predicting: 100%|██████████| 501/501 [01:41<00:00,  4.92it/s]
  Predictions saved: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/predict/predictions.csv  (2002 regions)
Finished stage 3

============================================================
All stages completed!
============================================================
Fine-tuned checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Predictions: output_region_activity_regression/predict/predictions.csv
[ ]:
predict_df = results.predictions_df # test prediction
predict_df.head()
chrom start end build_region_index start_input end_input predicted_value true_label
0 chr1 1012000 1013000 304 1012000 1013000 -0.082217 -0.254314
1 chr1 1065000 1066000 357 1065000 1066000 -0.416241 -0.024423
2 chr1 1318000 1319000 597 1318000 1319000 -0.882017 -1.585025
3 chr1 2035000 2036000 1247 2035000 2036000 2.072878 2.632590
4 chr1 2556000 2557000 1737 2556000 2557000 -1.565579 -0.975630
[ ]:
# calculate the pearson and spearman correlation
from scipy.stats import pearsonr,spearmanr
pearsonr(predict_df.true_label,predict_df.predicted_value),spearmanr(predict_df.true_label,predict_df.predicted_value)

(PearsonRResult(statistic=0.8833282059899981, pvalue=0.0),
 SignificanceResult(statistic=0.8223618088793297, pvalue=0.0))
[ ]:
model_config = results.model_config # model configuration
data_config = results.data_config # data configuration
model_config,data_config
('output_region_activity_regression/model_config.json',
 'output_region_activity_regression/dataset_config.json')
[ ]:
ft_ckpt = results.model_ckpt # fine-tuned model checkpoint
# ft_ckpt = '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'
ft_ckpt
'/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'

2.Interpretation

After building a model, we can use the interpretation layer of ChromBERT-tools to derive biological insights.

Here, we identify important factors between different region groups as an example.

[ ]:
region1_file = "./output_region_activity_regression/dataset/up.csv"  # Regions with increased activity generated by the command above
region2_file = "./output_region_activity_regression/dataset/nochange.csv"  # Unchanged regions generated iby the command above

[ ]:
from chrombert_tools import interpret_regulator_effects_between_region_groups as run_key_regulator

factor_importance_rank = run_key_regulator(
    region1_file=region1_file, # your focus region group 1
    region2_file=region2_file, # your focus region group 2
    odir="output_interpret_regulator_transition_myoblast", # output directory
    genome="hg38", # genome
    resolution="1kb", # resolution
    ft_ckpt=ft_ckpt, # fine-tuned checkpoint in step 1
    batch_size=64, # batch size
    # model_config='./output_region_activity_regression/model_config.json',
    # data_config='./output_region_activity_regression/dataset_config.json',
    model_config=results.model_config, # model configuration in Step 1
    data_config=results.data_config, # data configuration in Step 1
)
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Region summary - total: 1000, overlapping with ChromBERT: 1000 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 0
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
100%|██████████| 300/300 [00:22<00:00, 13.28it/s]
100%|██████████| 290/290 [00:20<00:00, 13.92it/s]
Identify key regulators across regions(top 25)
        factors  similarity  rank  embedding_shift
0   pax3-foxo1a    0.168074     1         0.831926
1          myog    0.179956     2         0.820044
2          myf5    0.186609     3         0.813391
3         myod1    0.272119     4         0.727881
4          pax7    0.296411     5         0.703589
5          dux4    0.373782     6         0.626218
6         tead1    0.414811     7         0.585189
7       neurod1    0.440464     8         0.559536
8         ascl1    0.449937     9         0.550063
9          six2    0.459578    10         0.540422
10      neurog2    0.463017    11         0.536983
11        pgbd3    0.531903    12         0.468097
12         yap1    0.556353    13         0.443647
13         mycn    0.567710    14         0.432290
14        tcf21    0.605918    15         0.394082
15       pou3f2    0.639499    16         0.360501
16         six1    0.648377    17         0.351623
17         pax6    0.679352    18         0.320648
18         chd4    0.692183    19         0.307817
19         tbx5    0.697478    20         0.302522
20         ss18    0.703591    21         0.296409
21        znf92    0.709506    22         0.290494
22        ep400    0.716686    23         0.283314
23         tead    0.728783    24         0.271217
24        mef2b    0.731181    25         0.268819
Finished
Used fine-tuned ChromBERT checkpoint: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/api/output_region_activity_regression/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Key regulators across regions saved to: output_interpret_regulator_transition_myoblast/results/factor_importance_rank.csv
[8]:
factor_importance_rank.head(n=25)
[8]:
factors similarity rank embedding_shift
0 pax3-foxo1a 0.168074 1 0.831926
1 myog 0.179956 2 0.820044
2 myf5 0.186609 3 0.813391
3 myod1 0.272119 4 0.727881
4 pax7 0.296411 5 0.703589
5 dux4 0.373782 6 0.626218
6 tead1 0.414811 7 0.585189
7 neurod1 0.440464 8 0.559536
8 ascl1 0.449937 9 0.550063
9 six2 0.459578 10 0.540422
10 neurog2 0.463017 11 0.536983
11 pgbd3 0.531903 12 0.468097
12 yap1 0.556353 13 0.443647
13 mycn 0.567710 14 0.432290
14 tcf21 0.605918 15 0.394082
15 pou3f2 0.639499 16 0.360501
16 six1 0.648377 17 0.351623
17 pax6 0.679352 18 0.320648
18 chd4 0.692183 19 0.307817
19 tbx5 0.697478 20 0.302522
20 ss18 0.703591 21 0.296409
21 znf92 0.709506 22 0.290494
22 ep400 0.716686 23 0.283314
23 tead 0.728783 24 0.271217
24 mef2b 0.731181 25 0.268819
[ ]: