Identify driver factors in cell state transitions

The predict_transition_driver_regulators command identifies key transcription factors that drive changes in gene expression and/or chromatin accessibility during cell state transitions (e.g., differentiation or reprogramming).

You can run this command with:

  • expression only,

  • accessibility only, or

  • both expression and accessibility.

Provide the corresponding input files for the analyses you want to perform.

Note: The remaining examples will only show the direct command usage.

If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.

For more details, please refer to the `predict_transition_driver_regulators <https://chrombert-tools.readthedocs.io/en/latest/commands/predict_transition_driver_regulators.html>`__ command documentation.

Example:

Find driver regulators in fibroblast-to-myoblast transition using both expression and accessibility

[1]:
import pandas as pd
import numpy as np
import os

[2]:
os.environ["CUDA_VISIBLE_DEVICES"]='1'
[4]:
!chrombert-tools predict_transition_driver_regulators -h
Usage: chrombert-tools predict_transition_driver_regulators
           [OPTIONS]

  Find driver factors in cell state transitions.

  Provide at least one of: - Expression: --exp-tpm1 and --exp-tpm2 -
  Accessibility: --acc-peak1 and --acc-signal1 (optional --acc-peak2; add
  --acc-signal2 for fold-change / acc stage 3).

  Merged rankings require both expression and two-state accessibility
  rankings.

Options:
  --exp-tpm1 FILE                 Expression (TPM) file for cell state 1. CSV
                                  format with 'gene_id' and 'tpm' columns.
  --exp-tpm2 FILE                 Expression (TPM) file for cell state 2. CSV
                                  format with 'gene_id' and 'tpm' columns.
  --acc-peak1 TEXT                BED peak file(s) for state 1; use ';' for
                                  multiple.
  --acc-peak2 TEXT                Optional BED peak file(s) for state 2; use
                                  ';' for multiple.
  --acc-signal1 TEXT              BigWig(s) for state 1; use ';' for
                                  replicates.
  --acc-signal2 TEXT              Optional. BigWig(s) for state 2; use ';' for
                                  replicates.
  --direction [2-1|1-2]           Direction of cell state transition: '2-1'
                                  means from state 1 to state 2; '1-2' means
                                  from state 2 to state 1.  [default: 2-1]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --genome [hg38|mm10]            Reference genome.  [default: hg38]
  --resolution [200bp|1kb|2kb|4kb]
                                  ChromBERT resolution.  [default: 1kb]
  --mode [fast|full]              Training mode: 'fast' downsamples to 20k
                                  regions for quicker training; 'full' uses
                                  all regions.  [default: fast]
  --flank-window INTEGER          Flank window for GEP (expression) model when
                                  loading or training.  [default: 4]
  --tss-flank INTEGER             Flanking distance (bp) around TSS when TSS
                                  background is enabled (see --include-tss-
                                  background).  [default: 10000]
  --include-tss-background BOOLEAN
                                  Include protein-coding TSS±flank bins as
                                  extra background in the accessibility
                                  dataset (needs gene_meta). Default: true;
                                  pass e.g. --include-tss-background false or
                                  no to disable.  [default: True]
  --ft-ckpt-exp FILE              Fine-tuned ChromBERT checkpoint file for
                                  expression changes.
  --ft-ckpt-acc FILE              Fine-tuned ChromBERT checkpoint file for
                                  chromatin accessibility changes.
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache directory (containing
                                  config/ and anno/ subfolders).  [default:
                                  ~/.cache/chrombert/data]
  --batch-size INTEGER            Batch size. Increase this value if you have
                                  sufficient GPU memory.  [default: 4]
  -h, --help                      Show this message and exit.

Run

[5]:
# Download data
import subprocess
if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
    subprocess.run(cmd, shell=True)
    cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
    subprocess.run(cmd, shell=True)

if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
    subprocess.run(cmd, shell=True)


if not os.path.exists('../data/fibroblast_ENCFF184KAM_peak.bed'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF184KAM/@@download/ENCFF184KAM.bed.gz -O ../data/fibroblast_ENCFF184KAM_peak.bed.gz'
    subprocess.run(cmd, shell=True)
    cmd = f"gzip -d ../data/fibroblast_ENCFF184KAM_peak.bed.gz"
    subprocess.run(cmd, shell=True)


if not os.path.exists('../data/fibroblast_ENCFF361BTT_signal.bigwig'):
    cmd = 'wget https://www.encodeproject.org/files/ENCFF361BTT/@@download/ENCFF361BTT.bigWig -O ../data/fibroblast_ENCFF361BTT_signal.bigwig'
    subprocess.run(cmd, shell=True)
[ ]:
# Runtime estimates:
#   - fast mode: ~3-5 hours
#     (uses all ~19,620 genes for expression analysis, but downsamples
#      chromatin accessibility regions to 20k for faster training)
#
# Note: Both modes (fast and full) use the complete gene expression dataset. The 'fast' mode
# only downsamples chromatin accessibility regions, not gene data.

# So this downsampled 5000 genes for expression analysis for test (40-100 minutes)

# --exp-tpm1: expression tpm file of cell type 1
# --exp-tpm2: expression tpm file of cell type 2
# --acc-peak1: accessibility peak file of cell type 1
# --acc-peak2: accessibility peak file of cell type 2
# --acc-signal1: accessibility signal file of cell type 1
# --acc-signal2: accessibility signal file of cell type 2
# --genome: genome
# --resolution: resolution
# --odir: output directory
# --direction: direction of transition

!chrombert-tools predict_transition_driver_regulators \
  --exp-tpm1 "../data/fibroblast_expression_sample5000.csv" \
  --exp-tpm2 "../data/myoblast_expression_sample5000.csv" \
  --acc-peak1 "../data/fibroblast_ENCFF184KAM_peak.bed" \
  --acc-peak2 "../data/myoblast_ENCFF647RNC_peak.bed" \
  --acc-signal1 "../data/fibroblast_ENCFF361BTT_signal.bigwig" \
  --acc-signal2 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
  --genome 'hg38' \
  --resolution '1kb' \
  --odir output_find_driver_in_transition \
  --direction "2-1" 2> "./tmp/hg38_1kb.stderr.log"
Whether to train ChromBERT to predict chromatin accessibility changes in cell state transition: True
Stage 1 (acc): prepare dataset
Processing stage 1: prepare chromatin accessibility dataset
  Mode: two states (fold-change label)
  TSS ± flank background regions: enabled, flank distance: 10000 bp
[state1_0] Region summary - total: 287892, overlapping with ChromBERT: 284160 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 5967
  [state1] merged 1 peak file(s) → 250143 unique ChromBERT regions (after overlap + dedup)
[state2_0] Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
  [state2] merged 1 peak file(s) → 324690 unique ChromBERT regions (after overlap + dedup)
[tss_bg] Region summary - total: 19260, overlapping with ChromBERT: 333463 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 19260
 total region: 671022
  Fast mode: downsampled to ~20000 regions
Finished Stage 1 (acc)
Stage 2 (acc): train ChromBERT to predict chromatin accessibility changes in cell state transition

[Attempt 0/2] seed=55
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Epoch 0:  20%|████▍                 | 800/4000 [02:17<09:09,  5.82it/s, v_num=0]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.34it/s]
Epoch 0:  40%|▍| 1600/4000 [04:59<07:29,  5.34it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 0:  60%|▌| 2400/4000 [07:42<05:08,  5.19it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 0:  80%|▊| 3200/4000 [10:25<02:36,  5.12it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.27it/s]
Epoch 0: 100%|█| 4000/4000 [13:07<00:00,  5.08it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1:  20%|▏| 800/4000 [02:16<09:07,  5.85it/s, v_num=0, default_validation/r
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 1:  40%|▍| 1600/4000 [04:59<07:28,  5.35it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1:  60%|▌| 2400/4000 [07:41<05:07,  5.20it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.28it/s]
Epoch 1:  80%|▊| 3200/4000 [10:23<02:35,  5.13it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1: 100%|█| 4000/4000 [13:05<00:00,  5.10it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2:  20%|▏| 800/4000 [02:16<09:06,  5.85it/s, v_num=0, default_validation/r
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2:  40%|▍| 1600/4000 [04:57<07:26,  5.37it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2:  60%|▌| 2400/4000 [07:40<05:07,  5.21it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.24it/s]
Epoch 2:  80%|▊| 3200/4000 [10:23<02:35,  5.13it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.27it/s]
Epoch 2: 100%|█| 4000/4000 [13:05<00:00,  5.09it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 3:  20%|▏| 800/4000 [02:17<09:08,  5.83it/s, v_num=0, default_validation/r
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 3:  40%|▍| 1600/4000 [04:59<07:28,  5.35it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.28it/s]
Epoch 3:  60%|▌| 2400/4000 [07:40<05:07,  5.21it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.17it/s]
Epoch 3:  60%|▌| 2400/4000 [08:05<05:23,  4.94it/s, v_num=0, default_validation/
Evaluating the finetuned model performance
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
ft_ckpt: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt, test_metrics: {'pearsonr': 0.8857296109199524, 'spearmanr': 0.8241026997566223, 'mse': 0.32735225558280945, 'mae': 0.40540555119514465, 'r2': 0.7272971742515542}
Attempt metrics: pearsonr=0.8857296109199524
Accepted run (pearsonr=0.8857 >= 0.2).

Finished stage 2: obtained a fine-tuned ChromBERT
Best pearsonr=0.8857296109199524, metrics={'pearsonr': 0.8857296109199524, 'spearmanr': 0.8241026997566223, 'mse': 0.32735225558280945, 'mae': 0.40540555119514465, 'r2': 0.7272971742515542, 'ft_ckpt': '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'}
Finished stage 2 (trained)
[ ]:
### Factor importance ranking using the transcriptome modality

odir = "./output_find_driver_in_transition"
exp_odir = f'{odir}/exp'
exp_results_odir = f"{exp_odir}/results"
exp_rank_df = pd.read_csv(os.path.join(exp_results_odir, "factor_importance_rank.csv"))
exp_rank_df
factors similarity rank embedding_shift
0 histone lysine acetylation 0.921118 1 0.078882
1 histone lysine crotonylation 0.923516 2 0.076484
2 cbx6 0.929447 3 0.070553
3 cbx7 0.933499 4 0.066501
4 h2ak5ac 0.943805 5 0.056195
... ... ... ... ...
1067 snapc4 0.996417 1068 0.003583
1068 foxo4 0.996558 1069 0.003442
1069 id2 0.996619 1070 0.003381
1070 mafg 0.996837 1071 0.003163
1071 tfcp2 0.996966 1072 0.003034

1072 rows × 4 columns

[ ]:
### Factor importance ranking using the chromatin accessibility modality

acc_odir = f'{odir}/acc'
acc_results_odir = f"{acc_odir}/results"
acc_rank_df = pd.read_csv(os.path.join(acc_results_odir, "factor_importance_rank.csv"))
acc_rank_df
factors similarity rank embedding_shift
0 pax3-foxo1a 0.140829 1 0.859171
1 myog 0.150968 2 0.849032
2 myf5 0.181639 3 0.818361
3 myod1 0.267095 4 0.732905
4 pax7 0.304148 5 0.695852
... ... ... ... ...
1067 znf26 0.983500 1068 0.016500
1068 tox4 0.983787 1069 0.016213
1069 znf266 0.984463 1070 0.015537
1070 zbed4 0.984526 1071 0.015474
1071 pitx1 0.985094 1072 0.014906

1072 rows × 4 columns

[ ]:
### Factor importance ranking using the transcriptome and chromatin accessibility modalities

import pandas as pd
merge_odir=f'{odir}/merge'
merge_df = pd.read_csv(os.path.join(merge_odir, "factor_importance_rank.csv"))
merge_df.head(n=10)
factors similarity_exp rank_exp embedding_shift_exp similarity_acc rank_acc embedding_shift_acc total_rank
0 chd4 0.975052 25 0.024948 0.470624 15 0.529376 1
1 yap1 0.978078 34 0.021922 0.420340 13 0.579660 2
2 neurog2 0.980249 37 0.019751 0.402066 10 0.597934 2
3 myf5 0.983809 54 0.016191 0.181639 3 0.818361 4
4 h3k36me1 0.957156 8 0.042844 0.802063 57 0.197937 5
5 esco2 0.977604 33 0.022396 0.703099 34 0.296901 6
6 tcf21 0.982792 48 0.017208 0.538058 20 0.461942 7
7 pgbd3 0.984118 56 0.015882 0.418980 12 0.581020 7
8 tead1 0.985233 67 0.014767 0.357693 8 0.642307 9
9 myod1 0.986058 72 0.013942 0.267095 4 0.732905 10
[ ]:
### merge the two modality
merge_df = pd.merge(exp_rank_df,acc_rank_df,on='factors',how='inner',suffixes=['_exp','_acc'])
merge_df
factors similarity_exp rank_exp embedding_shift_exp similarity_acc rank_acc embedding_shift_acc
0 histone lysine acetylation 0.921118 1 0.078882 0.937713 195 0.062287
1 histone lysine crotonylation 0.923516 2 0.076484 0.941648 217 0.058352
2 cbx6 0.929447 3 0.070553 0.894013 98 0.105987
3 cbx7 0.933499 4 0.066501 0.901252 108 0.098748
4 h2ak5ac 0.943805 5 0.056195 0.913511 130 0.086489
... ... ... ... ... ... ... ...
1067 snapc4 0.996417 1068 0.003583 0.971879 823 0.028121
1068 foxo4 0.996558 1069 0.003442 0.972842 844 0.027158
1069 id2 0.996619 1070 0.003381 0.972055 828 0.027945
1070 mafg 0.996837 1071 0.003163 0.978791 972 0.021209
1071 tfcp2 0.996966 1072 0.003034 0.977366 936 0.022634

1072 rows × 7 columns

[18]:
merge_df['total_rank']=((merge_df['rank_exp']+merge_df['rank_acc'])/2).rank().astype(int)
merge_df = merge_df.sort_values('total_rank').reset_index(drop=True)
merge_df
[18]:
factors similarity_exp rank_exp embedding_shift_exp similarity_acc rank_acc embedding_shift_acc total_rank
0 chd4 0.975052 25 0.024948 0.470624 15 0.529376 1
1 yap1 0.978078 34 0.021922 0.420340 13 0.579660 2
2 neurog2 0.980249 37 0.019751 0.402066 10 0.597934 2
3 myf5 0.983809 54 0.016191 0.181639 3 0.818361 4
4 h3k36me1 0.957156 8 0.042844 0.802063 57 0.197937 5
... ... ... ... ... ... ... ... ...
1067 hes4 0.996293 1064 0.003707 0.982166 1049 0.017834 1068
1068 znf26 0.996089 1047 0.003911 0.983500 1068 0.016500 1069
1069 zbtb10 0.996412 1067 0.003588 0.982231 1052 0.017769 1070
1070 zbed4 0.996220 1057 0.003780 0.984526 1071 0.015474 1071
1071 pitx1 0.996342 1065 0.003658 0.985094 1072 0.014906 1072

1072 rows × 8 columns

Load the fine-tuned checkpoint to infer key regulators and TRN for myoblast (skip fine-tuning)

[ ]:
# Path pattern for fine-tuned checkpoints generated by the command above
ft_ckpt_exp_dir = "./output_find_driver_in_transition/exp/train/**/*.ckpt"
ft_ckpt_acc_dir = "./output_find_driver_in_transition/acc/train/**/*.ckpt"

import glob
ft_ckpt_exp = glob.glob(ft_ckpt_exp_dir, recursive=True)[0]
ft_ckpt_acc = glob.glob(ft_ckpt_acc_dir, recursive=True)[0]
[20]:
ft_ckpt_exp,ft_ckpt_acc
[20]:
('./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt',
 './output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt')
[ ]:
# --ft-ckpt-exp: fine-tuned checkpoint for expression modality
# --ft-ckpt-acc: fine-tuned checkpoint for chromatin accessibility modality
# --genome: genome
# --exp-tpm1: expression tpm file of cell type 1
# --exp-tpm2: expression tpm file of cell type 2
# --acc-peak1: accessibility peak file of cell type 1
# --acc-peak2: accessibility peak file of cell type 2
# --acc-signal1: accessibility signal file of cell type 1
# --acc-signal2: accessibility signal file of cell type 2
# --resolution: resolution
# --odir: output directory
# --direction: direction of transition

!chrombert-tools predict_transition_driver_regulators \
  --ft-ckpt-exp {ft_ckpt_exp} \
  --ft-ckpt-acc {ft_ckpt_acc} \
  --genome 'hg38' \
  --exp-tpm1 "../data/fibroblast_expression_sample5000.csv" \
  --exp-tpm2 "../data/myoblast_expression_sample5000.csv" \
  --acc-peak1 "../data/fibroblast_ENCFF184KAM_peak.bed" \
  --acc-peak2 "../data/myoblast_ENCFF647RNC_peak.bed" \
  --acc-signal1 "../data/fibroblast_ENCFF361BTT_signal.bigwig" \
  --acc-signal2 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
  --resolution '1kb' \
  --odir output_find_driver_in_transition_load_ckpt \
  --direction "2-1" 2> "./tmp/hg38_1kb.stderr.log"
Whether to train ChromBERT to predict chromatin accessibility changes in cell state transition: True
Stage 1 (acc): prepare dataset
Processing stage 1: prepare chromatin accessibility dataset
  Mode: two states (fold-change label)
  TSS ± flank background regions: enabled, flank distance: 10000 bp
[state1_0] Region summary - total: 287892, overlapping with ChromBERT: 284160 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 5967
  [state1] merged 1 peak file(s) → 250143 unique ChromBERT regions (after overlap + dedup)
[state2_0] Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
  [state2] merged 1 peak file(s) → 324690 unique ChromBERT regions (after overlap + dedup)
[tss_bg] Region summary - total: 19260, overlapping with ChromBERT: 333463 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 19260
 total region: 671022
  Fast mode: downsampled to ~20000 regions
Finished Stage 1 (acc)
Stage 2 (acc): train ChromBERT to predict chromatin accessibility changes in cell state transition
Using fine-tuned checkpoint: ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
Finished stage 2 (loaded checkpoint)
Finished Stage 2 (acc)
Stage 3 (acc): infer driver factors in regions with strong vs weak accessibility change
Finished stage 3 (acc)
Whether to train ChromBERT to predict expression changes in cell state transition: True
Stage 1 (exp): prepare dataset
Processing stage 1: prepare expression dataset
  Mode: two states (log1p TPM fold change; per state, genes inner-merged across ';' files then TPM mean)
Finished Stage 1 (exp)
Stage 2 (exp): train ChromBERT to predict expression changes in cell state transition
Use fine-tuned ChromBERT checkpoint file: ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
Finished stage 2 (loaded checkpoint)
Finished Stage 2 (exp)
Stage 3 (exp): infer driver factors in different expression activity genes
Finished stage 3 (exp)
Merging factor ranks from expression and chromatin accessibility
Finished merging factor ranks from expression and chromatin accessibility
Finished all stages!
Used fine-tuned ChromBERT checkpoint: ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Driver factors for expression changes: output_find_driver_in_transition_load_ckpt/exp/results/factor_importance_rank.csv
Used fine-tuned ChromBERT checkpoint: ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Driver factors for chromatin accessibility changes: output_find_driver_in_transition_load_ckpt/acc/results/factor_importance_rank.csv
Integrated driver factor rankings: output_find_driver_in_transition_load_ckpt/merge/factor_importance_rank.csv
[ ]:
### Factor importance ranking using the transcriptome modality
odir = "./output_find_driver_in_transition_load_ckpt"
exp_odir = f'{odir}/exp'
exp_results_odir = f"{exp_odir}/results"
exp_rank_df = pd.read_csv(os.path.join(exp_results_odir, "factor_importance_rank.csv"))
exp_rank_df.head(n=10)
factors similarity rank embedding_shift
0 histone lysine acetylation 0.921118 1 0.078882
1 histone lysine crotonylation 0.923516 2 0.076484
2 cbx6 0.929447 3 0.070553
3 cbx7 0.933499 4 0.066501
4 h2ak5ac 0.943805 5 0.056195
5 h2ax 0.947999 6 0.052001
6 h3k79me1 0.955875 7 0.044125
7 h3k36me1 0.957156 8 0.042844
8 cbx8 0.957597 9 0.042403
9 h2bk20ac 0.961643 10 0.038357
[ ]:
### Factor importance ranking using the chromatin accessibility modality
acc_odir = f'{odir}/acc'
acc_results_odir = f"{acc_odir}/results"
acc_rank_df = pd.read_csv(os.path.join(acc_results_odir, "factor_importance_rank.csv"))
acc_rank_df.head(n=10)
factors similarity rank embedding_shift
0 pax3-foxo1a 0.140829 1 0.859171
1 myog 0.150968 2 0.849032
2 myf5 0.181639 3 0.818361
3 myod1 0.267095 4 0.732905
4 pax7 0.304148 5 0.695852
5 ascl1 0.345684 6 0.654316
6 dux4 0.352310 7 0.647690
7 tead1 0.357693 8 0.642307
8 neurod1 0.387828 9 0.612172
9 neurog2 0.402066 10 0.597934
[ ]:
### Factor importance ranking using the transcriptome and chromatin accessibility modalities
import pandas as pd
merge_odir=f'{odir}/merge'
merge_df = pd.read_csv(os.path.join(merge_odir, "factor_importance_rank.csv"))
merge_df.head(n=10)
factors similarity_exp rank_exp embedding_shift_exp similarity_acc rank_acc embedding_shift_acc total_rank
0 chd4 0.975052 25 0.024948 0.470624 15 0.529376 1
1 yap1 0.978078 34 0.021922 0.420340 13 0.579660 2
2 neurog2 0.980249 37 0.019751 0.402066 10 0.597934 2
3 myf5 0.983809 54 0.016191 0.181639 3 0.818361 4
4 h3k36me1 0.957156 8 0.042844 0.802063 57 0.197937 5
5 esco2 0.977604 33 0.022396 0.703099 34 0.296901 6
6 tcf21 0.982792 48 0.017208 0.538058 20 0.461942 7
7 pgbd3 0.984118 56 0.015882 0.418980 12 0.581020 7
8 tead1 0.985233 67 0.014767 0.357693 8 0.642307 9
9 myod1 0.986058 72 0.013942 0.267095 4 0.732905 10
[27]:
merge_df = pd.merge(exp_rank_df,acc_rank_df,on='factors',how='inner',suffixes=['_exp','_acc'])
merge_df
[27]:
factors similarity_exp rank_exp embedding_shift_exp similarity_acc rank_acc embedding_shift_acc
0 histone lysine acetylation 0.921118 1 0.078882 0.937713 195 0.062287
1 histone lysine crotonylation 0.923516 2 0.076484 0.941648 217 0.058352
2 cbx6 0.929447 3 0.070553 0.894013 98 0.105987
3 cbx7 0.933499 4 0.066501 0.901252 108 0.098748
4 h2ak5ac 0.943805 5 0.056195 0.913511 130 0.086489
... ... ... ... ... ... ... ...
1067 snapc4 0.996417 1068 0.003583 0.971879 823 0.028121
1068 foxo4 0.996558 1069 0.003442 0.972842 844 0.027158
1069 id2 0.996619 1070 0.003381 0.972055 828 0.027945
1070 mafg 0.996837 1071 0.003163 0.978791 972 0.021209
1071 tfcp2 0.996966 1072 0.003034 0.977366 936 0.022634

1072 rows × 7 columns

[28]:
merge_df['total_rank']=((merge_df['rank_exp']+merge_df['rank_acc'])/2).rank().astype(int)
merge_df = merge_df.sort_values('total_rank').reset_index(drop=True)
merge_df
[28]:
factors similarity_exp rank_exp embedding_shift_exp similarity_acc rank_acc embedding_shift_acc total_rank
0 chd4 0.975052 25 0.024948 0.470624 15 0.529376 1
1 yap1 0.978078 34 0.021922 0.420340 13 0.579660 2
2 neurog2 0.980249 37 0.019751 0.402066 10 0.597934 2
3 myf5 0.983809 54 0.016191 0.181639 3 0.818361 4
4 h3k36me1 0.957156 8 0.042844 0.802063 57 0.197937 5
... ... ... ... ... ... ... ... ...
1067 hes4 0.996293 1064 0.003707 0.982166 1049 0.017834 1068
1068 znf26 0.996089 1047 0.003911 0.983500 1068 0.016500 1069
1069 zbtb10 0.996412 1067 0.003588 0.982231 1052 0.017769 1070
1070 zbed4 0.996220 1057 0.003780 0.984526 1071 0.015474 1071
1071 pitx1 0.996342 1065 0.003658 0.985094 1072 0.014906 1072

1072 rows × 8 columns

[ ]: