Identify driver factors in cell state transitions¶
The predict_transition_driver_regulators command identifies key transcription factors that drive changes in gene expression and/or chromatin accessibility during cell state transitions (e.g., differentiation or reprogramming).
You can run this command with:
expression only,
accessibility only, or
both expression and accessibility.
Provide the corresponding input files for the analyses you want to perform.
Note: The remaining examples will only show the direct command usage.
If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.
For more details, please refer to the `predict_transition_driver_regulators <https://chrombert-tools.readthedocs.io/en/latest/commands/predict_transition_driver_regulators.html>`__ command documentation.
Example:¶
Find driver regulators in fibroblast-to-myoblast transition using both expression and accessibility
[1]:
import pandas as pd
import numpy as np
import os
[2]:
os.environ["CUDA_VISIBLE_DEVICES"]='1'
[4]:
!chrombert-tools predict_transition_driver_regulators -h
Usage: chrombert-tools predict_transition_driver_regulators
[OPTIONS]
Find driver factors in cell state transitions.
Provide at least one of: - Expression: --exp-tpm1 and --exp-tpm2 -
Accessibility: --acc-peak1 and --acc-signal1 (optional --acc-peak2; add
--acc-signal2 for fold-change / acc stage 3).
Merged rankings require both expression and two-state accessibility
rankings.
Options:
--exp-tpm1 FILE Expression (TPM) file for cell state 1. CSV
format with 'gene_id' and 'tpm' columns.
--exp-tpm2 FILE Expression (TPM) file for cell state 2. CSV
format with 'gene_id' and 'tpm' columns.
--acc-peak1 TEXT BED peak file(s) for state 1; use ';' for
multiple.
--acc-peak2 TEXT Optional BED peak file(s) for state 2; use
';' for multiple.
--acc-signal1 TEXT BigWig(s) for state 1; use ';' for
replicates.
--acc-signal2 TEXT Optional. BigWig(s) for state 2; use ';' for
replicates.
--direction [2-1|1-2] Direction of cell state transition: '2-1'
means from state 1 to state 2; '1-2' means
from state 2 to state 1. [default: 2-1]
--odir DIRECTORY Output directory. [default: ./output]
--genome [hg38|mm10] Reference genome. [default: hg38]
--resolution [200bp|1kb|2kb|4kb]
ChromBERT resolution. [default: 1kb]
--mode [fast|full] Training mode: 'fast' downsamples to 20k
regions for quicker training; 'full' uses
all regions. [default: fast]
--flank-window INTEGER Flank window for GEP (expression) model when
loading or training. [default: 4]
--tss-flank INTEGER Flanking distance (bp) around TSS when TSS
background is enabled (see --include-tss-
background). [default: 10000]
--include-tss-background BOOLEAN
Include protein-coding TSS±flank bins as
extra background in the accessibility
dataset (needs gene_meta). Default: true;
pass e.g. --include-tss-background false or
no to disable. [default: True]
--ft-ckpt-exp FILE Fine-tuned ChromBERT checkpoint file for
expression changes.
--ft-ckpt-acc FILE Fine-tuned ChromBERT checkpoint file for
chromatin accessibility changes.
--chrombert-cache-dir DIRECTORY
ChromBERT cache directory (containing
config/ and anno/ subfolders). [default:
~/.cache/chrombert/data]
--batch-size INTEGER Batch size. Increase this value if you have
sufficient GPU memory. [default: 4]
-h, --help Show this message and exit.
Run¶
[5]:
# Download data
import subprocess
if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
subprocess.run(cmd, shell=True)
cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
subprocess.run(cmd, shell=True)
if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
subprocess.run(cmd, shell=True)
if not os.path.exists('../data/fibroblast_ENCFF184KAM_peak.bed'):
cmd = f'wget https://www.encodeproject.org/files/ENCFF184KAM/@@download/ENCFF184KAM.bed.gz -O ../data/fibroblast_ENCFF184KAM_peak.bed.gz'
subprocess.run(cmd, shell=True)
cmd = f"gzip -d ../data/fibroblast_ENCFF184KAM_peak.bed.gz"
subprocess.run(cmd, shell=True)
if not os.path.exists('../data/fibroblast_ENCFF361BTT_signal.bigwig'):
cmd = 'wget https://www.encodeproject.org/files/ENCFF361BTT/@@download/ENCFF361BTT.bigWig -O ../data/fibroblast_ENCFF361BTT_signal.bigwig'
subprocess.run(cmd, shell=True)
[ ]:
# Runtime estimates:
# - fast mode: ~3-5 hours
# (uses all ~19,620 genes for expression analysis, but downsamples
# chromatin accessibility regions to 20k for faster training)
#
# Note: Both modes (fast and full) use the complete gene expression dataset. The 'fast' mode
# only downsamples chromatin accessibility regions, not gene data.
# So this downsampled 5000 genes for expression analysis for test (40-100 minutes)
# --exp-tpm1: expression tpm file of cell type 1
# --exp-tpm2: expression tpm file of cell type 2
# --acc-peak1: accessibility peak file of cell type 1
# --acc-peak2: accessibility peak file of cell type 2
# --acc-signal1: accessibility signal file of cell type 1
# --acc-signal2: accessibility signal file of cell type 2
# --genome: genome
# --resolution: resolution
# --odir: output directory
# --direction: direction of transition
!chrombert-tools predict_transition_driver_regulators \
--exp-tpm1 "../data/fibroblast_expression_sample5000.csv" \
--exp-tpm2 "../data/myoblast_expression_sample5000.csv" \
--acc-peak1 "../data/fibroblast_ENCFF184KAM_peak.bed" \
--acc-peak2 "../data/myoblast_ENCFF647RNC_peak.bed" \
--acc-signal1 "../data/fibroblast_ENCFF361BTT_signal.bigwig" \
--acc-signal2 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
--genome 'hg38' \
--resolution '1kb' \
--odir output_find_driver_in_transition \
--direction "2-1" 2> "./tmp/hg38_1kb.stderr.log"
Whether to train ChromBERT to predict chromatin accessibility changes in cell state transition: True
Stage 1 (acc): prepare dataset
Processing stage 1: prepare chromatin accessibility dataset
Mode: two states (fold-change label)
TSS ± flank background regions: enabled, flank distance: 10000 bp
[state1_0] Region summary - total: 287892, overlapping with ChromBERT: 284160 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 5967
[state1] merged 1 peak file(s) → 250143 unique ChromBERT regions (after overlap + dedup)
[state2_0] Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
[state2] merged 1 peak file(s) → 324690 unique ChromBERT regions (after overlap + dedup)
[tss_bg] Region summary - total: 19260, overlapping with ChromBERT: 333463 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 19260
total region: 671022
Fast mode: downsampled to ~20000 regions
Finished Stage 1 (acc)
Stage 2 (acc): train ChromBERT to predict chromatin accessibility changes in cell state transition
[Attempt 0/2] seed=55
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Epoch 0: 20%|████▍ | 800/4000 [02:17<09:09, 5.82it/s, v_num=0]
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.34it/s]
Epoch 0: 40%|▍| 1600/4000 [04:59<07:29, 5.34it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 0: 60%|▌| 2400/4000 [07:42<05:08, 5.19it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 0: 80%|▊| 3200/4000 [10:25<02:36, 5.12it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.27it/s]
Epoch 0: 100%|█| 4000/4000 [13:07<00:00, 5.08it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1: 20%|▏| 800/4000 [02:16<09:07, 5.85it/s, v_num=0, default_validation/r
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 1: 40%|▍| 1600/4000 [04:59<07:28, 5.35it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1: 60%|▌| 2400/4000 [07:41<05:07, 5.20it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.28it/s]
Epoch 1: 80%|▊| 3200/4000 [10:23<02:35, 5.13it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1: 100%|█| 4000/4000 [13:05<00:00, 5.10it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2: 20%|▏| 800/4000 [02:16<09:06, 5.85it/s, v_num=0, default_validation/r
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2: 40%|▍| 1600/4000 [04:57<07:26, 5.37it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2: 60%|▌| 2400/4000 [07:40<05:07, 5.21it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.24it/s]
Epoch 2: 80%|▊| 3200/4000 [10:23<02:35, 5.13it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.27it/s]
Epoch 2: 100%|█| 4000/4000 [13:05<00:00, 5.09it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 3: 20%|▏| 800/4000 [02:17<09:08, 5.83it/s, v_num=0, default_validation/r
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 3: 40%|▍| 1600/4000 [04:59<07:28, 5.35it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.28it/s]
Epoch 3: 60%|▌| 2400/4000 [07:40<05:07, 5.21it/s, v_num=0, default_validation/
Validation: | | 0/? [00:00<?, ?it/s]
Validation: | | 0/? [00:00<?, ?it/s]
Validation DataLoader 0: 0%| | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.17it/s]
Epoch 3: 60%|▌| 2400/4000 [08:05<05:23, 4.94it/s, v_num=0, default_validation/
Evaluating the finetuned model performance
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
ft_ckpt: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt, test_metrics: {'pearsonr': 0.8857296109199524, 'spearmanr': 0.8241026997566223, 'mse': 0.32735225558280945, 'mae': 0.40540555119514465, 'r2': 0.7272971742515542}
Attempt metrics: pearsonr=0.8857296109199524
Accepted run (pearsonr=0.8857 >= 0.2).
Finished stage 2: obtained a fine-tuned ChromBERT
Best pearsonr=0.8857296109199524, metrics={'pearsonr': 0.8857296109199524, 'spearmanr': 0.8241026997566223, 'mse': 0.32735225558280945, 'mae': 0.40540555119514465, 'r2': 0.7272971742515542, 'ft_ckpt': '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'}
Finished stage 2 (trained)
[ ]:
### Factor importance ranking using the transcriptome modality
odir = "./output_find_driver_in_transition"
exp_odir = f'{odir}/exp'
exp_results_odir = f"{exp_odir}/results"
exp_rank_df = pd.read_csv(os.path.join(exp_results_odir, "factor_importance_rank.csv"))
exp_rank_df
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | histone lysine acetylation | 0.921118 | 1 | 0.078882 |
| 1 | histone lysine crotonylation | 0.923516 | 2 | 0.076484 |
| 2 | cbx6 | 0.929447 | 3 | 0.070553 |
| 3 | cbx7 | 0.933499 | 4 | 0.066501 |
| 4 | h2ak5ac | 0.943805 | 5 | 0.056195 |
| ... | ... | ... | ... | ... |
| 1067 | snapc4 | 0.996417 | 1068 | 0.003583 |
| 1068 | foxo4 | 0.996558 | 1069 | 0.003442 |
| 1069 | id2 | 0.996619 | 1070 | 0.003381 |
| 1070 | mafg | 0.996837 | 1071 | 0.003163 |
| 1071 | tfcp2 | 0.996966 | 1072 | 0.003034 |
1072 rows × 4 columns
[ ]:
### Factor importance ranking using the chromatin accessibility modality
acc_odir = f'{odir}/acc'
acc_results_odir = f"{acc_odir}/results"
acc_rank_df = pd.read_csv(os.path.join(acc_results_odir, "factor_importance_rank.csv"))
acc_rank_df
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | pax3-foxo1a | 0.140829 | 1 | 0.859171 |
| 1 | myog | 0.150968 | 2 | 0.849032 |
| 2 | myf5 | 0.181639 | 3 | 0.818361 |
| 3 | myod1 | 0.267095 | 4 | 0.732905 |
| 4 | pax7 | 0.304148 | 5 | 0.695852 |
| ... | ... | ... | ... | ... |
| 1067 | znf26 | 0.983500 | 1068 | 0.016500 |
| 1068 | tox4 | 0.983787 | 1069 | 0.016213 |
| 1069 | znf266 | 0.984463 | 1070 | 0.015537 |
| 1070 | zbed4 | 0.984526 | 1071 | 0.015474 |
| 1071 | pitx1 | 0.985094 | 1072 | 0.014906 |
1072 rows × 4 columns
[ ]:
### Factor importance ranking using the transcriptome and chromatin accessibility modalities
import pandas as pd
merge_odir=f'{odir}/merge'
merge_df = pd.read_csv(os.path.join(merge_odir, "factor_importance_rank.csv"))
merge_df.head(n=10)
| factors | similarity_exp | rank_exp | embedding_shift_exp | similarity_acc | rank_acc | embedding_shift_acc | total_rank | |
|---|---|---|---|---|---|---|---|---|
| 0 | chd4 | 0.975052 | 25 | 0.024948 | 0.470624 | 15 | 0.529376 | 1 |
| 1 | yap1 | 0.978078 | 34 | 0.021922 | 0.420340 | 13 | 0.579660 | 2 |
| 2 | neurog2 | 0.980249 | 37 | 0.019751 | 0.402066 | 10 | 0.597934 | 2 |
| 3 | myf5 | 0.983809 | 54 | 0.016191 | 0.181639 | 3 | 0.818361 | 4 |
| 4 | h3k36me1 | 0.957156 | 8 | 0.042844 | 0.802063 | 57 | 0.197937 | 5 |
| 5 | esco2 | 0.977604 | 33 | 0.022396 | 0.703099 | 34 | 0.296901 | 6 |
| 6 | tcf21 | 0.982792 | 48 | 0.017208 | 0.538058 | 20 | 0.461942 | 7 |
| 7 | pgbd3 | 0.984118 | 56 | 0.015882 | 0.418980 | 12 | 0.581020 | 7 |
| 8 | tead1 | 0.985233 | 67 | 0.014767 | 0.357693 | 8 | 0.642307 | 9 |
| 9 | myod1 | 0.986058 | 72 | 0.013942 | 0.267095 | 4 | 0.732905 | 10 |
[ ]:
### merge the two modality
merge_df = pd.merge(exp_rank_df,acc_rank_df,on='factors',how='inner',suffixes=['_exp','_acc'])
merge_df
| factors | similarity_exp | rank_exp | embedding_shift_exp | similarity_acc | rank_acc | embedding_shift_acc | |
|---|---|---|---|---|---|---|---|
| 0 | histone lysine acetylation | 0.921118 | 1 | 0.078882 | 0.937713 | 195 | 0.062287 |
| 1 | histone lysine crotonylation | 0.923516 | 2 | 0.076484 | 0.941648 | 217 | 0.058352 |
| 2 | cbx6 | 0.929447 | 3 | 0.070553 | 0.894013 | 98 | 0.105987 |
| 3 | cbx7 | 0.933499 | 4 | 0.066501 | 0.901252 | 108 | 0.098748 |
| 4 | h2ak5ac | 0.943805 | 5 | 0.056195 | 0.913511 | 130 | 0.086489 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1067 | snapc4 | 0.996417 | 1068 | 0.003583 | 0.971879 | 823 | 0.028121 |
| 1068 | foxo4 | 0.996558 | 1069 | 0.003442 | 0.972842 | 844 | 0.027158 |
| 1069 | id2 | 0.996619 | 1070 | 0.003381 | 0.972055 | 828 | 0.027945 |
| 1070 | mafg | 0.996837 | 1071 | 0.003163 | 0.978791 | 972 | 0.021209 |
| 1071 | tfcp2 | 0.996966 | 1072 | 0.003034 | 0.977366 | 936 | 0.022634 |
1072 rows × 7 columns
[18]:
merge_df['total_rank']=((merge_df['rank_exp']+merge_df['rank_acc'])/2).rank().astype(int)
merge_df = merge_df.sort_values('total_rank').reset_index(drop=True)
merge_df
[18]:
| factors | similarity_exp | rank_exp | embedding_shift_exp | similarity_acc | rank_acc | embedding_shift_acc | total_rank | |
|---|---|---|---|---|---|---|---|---|
| 0 | chd4 | 0.975052 | 25 | 0.024948 | 0.470624 | 15 | 0.529376 | 1 |
| 1 | yap1 | 0.978078 | 34 | 0.021922 | 0.420340 | 13 | 0.579660 | 2 |
| 2 | neurog2 | 0.980249 | 37 | 0.019751 | 0.402066 | 10 | 0.597934 | 2 |
| 3 | myf5 | 0.983809 | 54 | 0.016191 | 0.181639 | 3 | 0.818361 | 4 |
| 4 | h3k36me1 | 0.957156 | 8 | 0.042844 | 0.802063 | 57 | 0.197937 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1067 | hes4 | 0.996293 | 1064 | 0.003707 | 0.982166 | 1049 | 0.017834 | 1068 |
| 1068 | znf26 | 0.996089 | 1047 | 0.003911 | 0.983500 | 1068 | 0.016500 | 1069 |
| 1069 | zbtb10 | 0.996412 | 1067 | 0.003588 | 0.982231 | 1052 | 0.017769 | 1070 |
| 1070 | zbed4 | 0.996220 | 1057 | 0.003780 | 0.984526 | 1071 | 0.015474 | 1071 |
| 1071 | pitx1 | 0.996342 | 1065 | 0.003658 | 0.985094 | 1072 | 0.014906 | 1072 |
1072 rows × 8 columns
Load the fine-tuned checkpoint to infer key regulators and TRN for myoblast (skip fine-tuning)¶
[ ]:
# Path pattern for fine-tuned checkpoints generated by the command above
ft_ckpt_exp_dir = "./output_find_driver_in_transition/exp/train/**/*.ckpt"
ft_ckpt_acc_dir = "./output_find_driver_in_transition/acc/train/**/*.ckpt"
import glob
ft_ckpt_exp = glob.glob(ft_ckpt_exp_dir, recursive=True)[0]
ft_ckpt_acc = glob.glob(ft_ckpt_acc_dir, recursive=True)[0]
[20]:
ft_ckpt_exp,ft_ckpt_acc
[20]:
('./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt',
'./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt')
[ ]:
# --ft-ckpt-exp: fine-tuned checkpoint for expression modality
# --ft-ckpt-acc: fine-tuned checkpoint for chromatin accessibility modality
# --genome: genome
# --exp-tpm1: expression tpm file of cell type 1
# --exp-tpm2: expression tpm file of cell type 2
# --acc-peak1: accessibility peak file of cell type 1
# --acc-peak2: accessibility peak file of cell type 2
# --acc-signal1: accessibility signal file of cell type 1
# --acc-signal2: accessibility signal file of cell type 2
# --resolution: resolution
# --odir: output directory
# --direction: direction of transition
!chrombert-tools predict_transition_driver_regulators \
--ft-ckpt-exp {ft_ckpt_exp} \
--ft-ckpt-acc {ft_ckpt_acc} \
--genome 'hg38' \
--exp-tpm1 "../data/fibroblast_expression_sample5000.csv" \
--exp-tpm2 "../data/myoblast_expression_sample5000.csv" \
--acc-peak1 "../data/fibroblast_ENCFF184KAM_peak.bed" \
--acc-peak2 "../data/myoblast_ENCFF647RNC_peak.bed" \
--acc-signal1 "../data/fibroblast_ENCFF361BTT_signal.bigwig" \
--acc-signal2 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
--resolution '1kb' \
--odir output_find_driver_in_transition_load_ckpt \
--direction "2-1" 2> "./tmp/hg38_1kb.stderr.log"
Whether to train ChromBERT to predict chromatin accessibility changes in cell state transition: True
Stage 1 (acc): prepare dataset
Processing stage 1: prepare chromatin accessibility dataset
Mode: two states (fold-change label)
TSS ± flank background regions: enabled, flank distance: 10000 bp
[state1_0] Region summary - total: 287892, overlapping with ChromBERT: 284160 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 5967
[state1] merged 1 peak file(s) → 250143 unique ChromBERT regions (after overlap + dedup)
[state2_0] Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
[state2] merged 1 peak file(s) → 324690 unique ChromBERT regions (after overlap + dedup)
[tss_bg] Region summary - total: 19260, overlapping with ChromBERT: 333463 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 19260
total region: 671022
Fast mode: downsampled to ~20000 regions
Finished Stage 1 (acc)
Stage 2 (acc): train ChromBERT to predict chromatin accessibility changes in cell state transition
Using fine-tuned checkpoint: ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
Finished stage 2 (loaded checkpoint)
Finished Stage 2 (acc)
Stage 3 (acc): infer driver factors in regions with strong vs weak accessibility change
Finished stage 3 (acc)
Whether to train ChromBERT to predict expression changes in cell state transition: True
Stage 1 (exp): prepare dataset
Processing stage 1: prepare expression dataset
Mode: two states (log1p TPM fold change; per state, genes inner-merged across ';' files then TPM mean)
Finished Stage 1 (exp)
Stage 2 (exp): train ChromBERT to predict expression changes in cell state transition
Use fine-tuned ChromBERT checkpoint file: ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
Finished stage 2 (loaded checkpoint)
Finished Stage 2 (exp)
Stage 3 (exp): infer driver factors in different expression activity genes
Finished stage 3 (exp)
Merging factor ranks from expression and chromatin accessibility
Finished merging factor ranks from expression and chromatin accessibility
Finished all stages!
Used fine-tuned ChromBERT checkpoint: ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Driver factors for expression changes: output_find_driver_in_transition_load_ckpt/exp/results/factor_importance_rank.csv
Used fine-tuned ChromBERT checkpoint: ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Driver factors for chromatin accessibility changes: output_find_driver_in_transition_load_ckpt/acc/results/factor_importance_rank.csv
Integrated driver factor rankings: output_find_driver_in_transition_load_ckpt/merge/factor_importance_rank.csv
[ ]:
### Factor importance ranking using the transcriptome modality
odir = "./output_find_driver_in_transition_load_ckpt"
exp_odir = f'{odir}/exp'
exp_results_odir = f"{exp_odir}/results"
exp_rank_df = pd.read_csv(os.path.join(exp_results_odir, "factor_importance_rank.csv"))
exp_rank_df.head(n=10)
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | histone lysine acetylation | 0.921118 | 1 | 0.078882 |
| 1 | histone lysine crotonylation | 0.923516 | 2 | 0.076484 |
| 2 | cbx6 | 0.929447 | 3 | 0.070553 |
| 3 | cbx7 | 0.933499 | 4 | 0.066501 |
| 4 | h2ak5ac | 0.943805 | 5 | 0.056195 |
| 5 | h2ax | 0.947999 | 6 | 0.052001 |
| 6 | h3k79me1 | 0.955875 | 7 | 0.044125 |
| 7 | h3k36me1 | 0.957156 | 8 | 0.042844 |
| 8 | cbx8 | 0.957597 | 9 | 0.042403 |
| 9 | h2bk20ac | 0.961643 | 10 | 0.038357 |
[ ]:
### Factor importance ranking using the chromatin accessibility modality
acc_odir = f'{odir}/acc'
acc_results_odir = f"{acc_odir}/results"
acc_rank_df = pd.read_csv(os.path.join(acc_results_odir, "factor_importance_rank.csv"))
acc_rank_df.head(n=10)
| factors | similarity | rank | embedding_shift | |
|---|---|---|---|---|
| 0 | pax3-foxo1a | 0.140829 | 1 | 0.859171 |
| 1 | myog | 0.150968 | 2 | 0.849032 |
| 2 | myf5 | 0.181639 | 3 | 0.818361 |
| 3 | myod1 | 0.267095 | 4 | 0.732905 |
| 4 | pax7 | 0.304148 | 5 | 0.695852 |
| 5 | ascl1 | 0.345684 | 6 | 0.654316 |
| 6 | dux4 | 0.352310 | 7 | 0.647690 |
| 7 | tead1 | 0.357693 | 8 | 0.642307 |
| 8 | neurod1 | 0.387828 | 9 | 0.612172 |
| 9 | neurog2 | 0.402066 | 10 | 0.597934 |
[ ]:
### Factor importance ranking using the transcriptome and chromatin accessibility modalities
import pandas as pd
merge_odir=f'{odir}/merge'
merge_df = pd.read_csv(os.path.join(merge_odir, "factor_importance_rank.csv"))
merge_df.head(n=10)
| factors | similarity_exp | rank_exp | embedding_shift_exp | similarity_acc | rank_acc | embedding_shift_acc | total_rank | |
|---|---|---|---|---|---|---|---|---|
| 0 | chd4 | 0.975052 | 25 | 0.024948 | 0.470624 | 15 | 0.529376 | 1 |
| 1 | yap1 | 0.978078 | 34 | 0.021922 | 0.420340 | 13 | 0.579660 | 2 |
| 2 | neurog2 | 0.980249 | 37 | 0.019751 | 0.402066 | 10 | 0.597934 | 2 |
| 3 | myf5 | 0.983809 | 54 | 0.016191 | 0.181639 | 3 | 0.818361 | 4 |
| 4 | h3k36me1 | 0.957156 | 8 | 0.042844 | 0.802063 | 57 | 0.197937 | 5 |
| 5 | esco2 | 0.977604 | 33 | 0.022396 | 0.703099 | 34 | 0.296901 | 6 |
| 6 | tcf21 | 0.982792 | 48 | 0.017208 | 0.538058 | 20 | 0.461942 | 7 |
| 7 | pgbd3 | 0.984118 | 56 | 0.015882 | 0.418980 | 12 | 0.581020 | 7 |
| 8 | tead1 | 0.985233 | 67 | 0.014767 | 0.357693 | 8 | 0.642307 | 9 |
| 9 | myod1 | 0.986058 | 72 | 0.013942 | 0.267095 | 4 | 0.732905 | 10 |
[27]:
merge_df = pd.merge(exp_rank_df,acc_rank_df,on='factors',how='inner',suffixes=['_exp','_acc'])
merge_df
[27]:
| factors | similarity_exp | rank_exp | embedding_shift_exp | similarity_acc | rank_acc | embedding_shift_acc | |
|---|---|---|---|---|---|---|---|
| 0 | histone lysine acetylation | 0.921118 | 1 | 0.078882 | 0.937713 | 195 | 0.062287 |
| 1 | histone lysine crotonylation | 0.923516 | 2 | 0.076484 | 0.941648 | 217 | 0.058352 |
| 2 | cbx6 | 0.929447 | 3 | 0.070553 | 0.894013 | 98 | 0.105987 |
| 3 | cbx7 | 0.933499 | 4 | 0.066501 | 0.901252 | 108 | 0.098748 |
| 4 | h2ak5ac | 0.943805 | 5 | 0.056195 | 0.913511 | 130 | 0.086489 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 1067 | snapc4 | 0.996417 | 1068 | 0.003583 | 0.971879 | 823 | 0.028121 |
| 1068 | foxo4 | 0.996558 | 1069 | 0.003442 | 0.972842 | 844 | 0.027158 |
| 1069 | id2 | 0.996619 | 1070 | 0.003381 | 0.972055 | 828 | 0.027945 |
| 1070 | mafg | 0.996837 | 1071 | 0.003163 | 0.978791 | 972 | 0.021209 |
| 1071 | tfcp2 | 0.996966 | 1072 | 0.003034 | 0.977366 | 936 | 0.022634 |
1072 rows × 7 columns
[28]:
merge_df['total_rank']=((merge_df['rank_exp']+merge_df['rank_acc'])/2).rank().astype(int)
merge_df = merge_df.sort_values('total_rank').reset_index(drop=True)
merge_df
[28]:
| factors | similarity_exp | rank_exp | embedding_shift_exp | similarity_acc | rank_acc | embedding_shift_acc | total_rank | |
|---|---|---|---|---|---|---|---|---|
| 0 | chd4 | 0.975052 | 25 | 0.024948 | 0.470624 | 15 | 0.529376 | 1 |
| 1 | yap1 | 0.978078 | 34 | 0.021922 | 0.420340 | 13 | 0.579660 | 2 |
| 2 | neurog2 | 0.980249 | 37 | 0.019751 | 0.402066 | 10 | 0.597934 | 2 |
| 3 | myf5 | 0.983809 | 54 | 0.016191 | 0.181639 | 3 | 0.818361 | 4 |
| 4 | h3k36me1 | 0.957156 | 8 | 0.042844 | 0.802063 | 57 | 0.197937 | 5 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1067 | hes4 | 0.996293 | 1064 | 0.003707 | 0.982166 | 1049 | 0.017834 | 1068 |
| 1068 | znf26 | 0.996089 | 1047 | 0.003911 | 0.983500 | 1068 | 0.016500 | 1069 |
| 1069 | zbtb10 | 0.996412 | 1067 | 0.003588 | 0.982231 | 1052 | 0.017769 | 1070 |
| 1070 | zbed4 | 0.996220 | 1057 | 0.003780 | 0.984526 | 1071 | 0.015474 | 1071 |
| 1071 | pitx1 | 0.996342 | 1065 | 0.003658 | 0.985094 | 1072 | 0.014906 | 1072 |
1072 rows × 8 columns
[ ]: