Identify driver factors in cell state transitions¶

The predict_transition_driver_regulators command identifies key transcription factors that drive changes in gene expression and/or chromatin accessibility during cell state transitions (e.g., differentiation or reprogramming).

You can run this command with:

expression only,
accessibility only, or
both expression and accessibility.

Provide the corresponding input files for the analyses you want to perform.

Note: The remaining examples will only show the direct command usage.

If you need to use Apptainer container, please refer to the `apptainer_use.ipynb <apptainer_use.ipynb>`__ tutorial for detailed instructions on using apptainer exec with chrombert-tools.

For more details, please refer to the `predict_transition_driver_regulators <https://chrombert-tools.readthedocs.io/en/latest/commands/predict_transition_driver_regulators.html>`__ command documentation.

Example:¶

Find driver regulators in fibroblast-to-myoblast transition using both expression and accessibility

[1]:

import pandas as pd
import numpy as np
import os

[2]:

os.environ["CUDA_VISIBLE_DEVICES"]='1'

[4]:

!chrombert-tools predict_transition_driver_regulators -h

Usage: chrombert-tools predict_transition_driver_regulators
           [OPTIONS]

  Find driver factors in cell state transitions.

  Provide at least one of: - Expression: --exp-tpm1 and --exp-tpm2 -
  Accessibility: --acc-peak1 and --acc-signal1 (optional --acc-peak2; add
  --acc-signal2 for fold-change / acc stage 3).

  Merged rankings require both expression and two-state accessibility
  rankings.

Options:
  --exp-tpm1 FILE                 Expression (TPM) file for cell state 1. CSV
                                  format with 'gene_id' and 'tpm' columns.
  --exp-tpm2 FILE                 Expression (TPM) file for cell state 2. CSV
                                  format with 'gene_id' and 'tpm' columns.
  --acc-peak1 TEXT                BED peak file(s) for state 1; use ';' for
                                  multiple.
  --acc-peak2 TEXT                Optional BED peak file(s) for state 2; use
                                  ';' for multiple.
  --acc-signal1 TEXT              BigWig(s) for state 1; use ';' for
                                  replicates.
  --acc-signal2 TEXT              Optional. BigWig(s) for state 2; use ';' for
                                  replicates.
  --direction [2-1|1-2]           Direction of cell state transition: '2-1'
                                  means from state 1 to state 2; '1-2' means
                                  from state 2 to state 1.  [default: 2-1]
  --odir DIRECTORY                Output directory.  [default: ./output]
  --genome [hg38|mm10]            Reference genome.  [default: hg38]
  --resolution [200bp|1kb|2kb|4kb]
                                  ChromBERT resolution.  [default: 1kb]
  --mode [fast|full]              Training mode: 'fast' downsamples to 20k
                                  regions for quicker training; 'full' uses
                                  all regions.  [default: fast]
  --flank-window INTEGER          Flank window for GEP (expression) model when
                                  loading or training.  [default: 4]
  --tss-flank INTEGER             Flanking distance (bp) around TSS when TSS
                                  background is enabled (see --include-tss-
                                  background).  [default: 10000]
  --include-tss-background BOOLEAN
                                  Include protein-coding TSS±flank bins as
                                  extra background in the accessibility
                                  dataset (needs gene_meta). Default: true;
                                  pass e.g. --include-tss-background false or
                                  no to disable.  [default: True]
  --ft-ckpt-exp FILE              Fine-tuned ChromBERT checkpoint file for
                                  expression changes.
  --ft-ckpt-acc FILE              Fine-tuned ChromBERT checkpoint file for
                                  chromatin accessibility changes.
  --chrombert-cache-dir DIRECTORY
                                  ChromBERT cache directory (containing
                                  config/ and anno/ subfolders).  [default:
                                  ~/.cache/chrombert/data]
  --batch-size INTEGER            Batch size. Increase this value if you have
                                  sufficient GPU memory.  [default: 4]
  -h, --help                      Show this message and exit.

Run¶

[5]:

# Download data
import subprocess
if not os.path.exists('../data/myoblast_ENCFF647RNC_peak.bed'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF647RNC/@@download/ENCFF647RNC.bed.gz -O ../data/myoblast_ENCFF647RNC_peak.bed.gz'
    subprocess.run(cmd, shell=True)
    cmd = f"gzip -d ../data/myoblast_ENCFF647RNC_peak.bed.gz"
    subprocess.run(cmd, shell=True)

if not os.path.exists('../data/myoblast_ENCFF149ERN_signal.bigwig'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF149ERN/@@download/ENCFF149ERN.bigWig -O ../data/myoblast_ENCFF149ERN_signal.bigwig'
    subprocess.run(cmd, shell=True)


if not os.path.exists('../data/fibroblast_ENCFF184KAM_peak.bed'):
    cmd = f'wget https://www.encodeproject.org/files/ENCFF184KAM/@@download/ENCFF184KAM.bed.gz -O ../data/fibroblast_ENCFF184KAM_peak.bed.gz'
    subprocess.run(cmd, shell=True)
    cmd = f"gzip -d ../data/fibroblast_ENCFF184KAM_peak.bed.gz"
    subprocess.run(cmd, shell=True)


if not os.path.exists('../data/fibroblast_ENCFF361BTT_signal.bigwig'):
    cmd = 'wget https://www.encodeproject.org/files/ENCFF361BTT/@@download/ENCFF361BTT.bigWig -O ../data/fibroblast_ENCFF361BTT_signal.bigwig'
    subprocess.run(cmd, shell=True)

[ ]:

# Runtime estimates:
#   - fast mode: ~3-5 hours
#     (uses all ~19,620 genes for expression analysis, but downsamples
#      chromatin accessibility regions to 20k for faster training)
#
# Note: Both modes (fast and full) use the complete gene expression dataset. The 'fast' mode
# only downsamples chromatin accessibility regions, not gene data.

# So this downsampled 5000 genes for expression analysis for test (40-100 minutes)

# --exp-tpm1: expression tpm file of cell type 1
# --exp-tpm2: expression tpm file of cell type 2
# --acc-peak1: accessibility peak file of cell type 1
# --acc-peak2: accessibility peak file of cell type 2
# --acc-signal1: accessibility signal file of cell type 1
# --acc-signal2: accessibility signal file of cell type 2
# --genome: genome
# --resolution: resolution
# --odir: output directory
# --direction: direction of transition

!chrombert-tools predict_transition_driver_regulators \
  --exp-tpm1 "../data/fibroblast_expression_sample5000.csv" \
  --exp-tpm2 "../data/myoblast_expression_sample5000.csv" \
  --acc-peak1 "../data/fibroblast_ENCFF184KAM_peak.bed" \
  --acc-peak2 "../data/myoblast_ENCFF647RNC_peak.bed" \
  --acc-signal1 "../data/fibroblast_ENCFF361BTT_signal.bigwig" \
  --acc-signal2 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
  --genome 'hg38' \
  --resolution '1kb' \
  --odir output_find_driver_in_transition \
  --direction "2-1" 2> "./tmp/hg38_1kb.stderr.log"

Whether to train ChromBERT to predict chromatin accessibility changes in cell state transition: True
Stage 1 (acc): prepare dataset
Processing stage 1: prepare chromatin accessibility dataset
  Mode: two states (fold-change label)
  TSS ± flank background regions: enabled, flank distance: 10000 bp
[state1_0] Region summary - total: 287892, overlapping with ChromBERT: 284160 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 5967
  [state1] merged 1 peak file(s) → 250143 unique ChromBERT regions (after overlap + dedup)
[state2_0] Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
  [state2] merged 1 peak file(s) → 324690 unique ChromBERT regions (after overlap + dedup)
[tss_bg] Region summary - total: 19260, overlapping with ChromBERT: 333463 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 19260
 total region: 671022
  Fast mode: downsampled to ~20000 regions
Finished Stage 1 (acc)
Stage 2 (acc): train ChromBERT to predict chromatin accessibility changes in cell state transition

[Attempt 0/2] seed=55
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Epoch 0:  20%|████▍                 | 800/4000 [02:17<09:09,  5.82it/s, v_num=0]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.34it/s]
Epoch 0:  40%|▍| 1600/4000 [04:59<07:29,  5.34it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 0:  60%|▌| 2400/4000 [07:42<05:08,  5.19it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 0:  80%|▊| 3200/4000 [10:25<02:36,  5.12it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.27it/s]
Epoch 0: 100%|█| 4000/4000 [13:07<00:00,  5.08it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1:  20%|▏| 800/4000 [02:16<09:07,  5.85it/s, v_num=0, default_validation/r
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 1:  40%|▍| 1600/4000 [04:59<07:28,  5.35it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1:  60%|▌| 2400/4000 [07:41<05:07,  5.20it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.28it/s]
Epoch 1:  80%|▊| 3200/4000 [10:23<02:35,  5.13it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 1: 100%|█| 4000/4000 [13:05<00:00,  5.10it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2:  20%|▏| 800/4000 [02:16<09:06,  5.85it/s, v_num=0, default_validation/r
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2:  40%|▍| 1600/4000 [04:57<07:26,  5.37it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.32it/s]
Epoch 2:  60%|▌| 2400/4000 [07:40<05:07,  5.21it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.24it/s]
Epoch 2:  80%|▊| 3200/4000 [10:23<02:35,  5.13it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.27it/s]
Epoch 2: 100%|█| 4000/4000 [13:05<00:00,  5.09it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.30it/s]
Epoch 3:  20%|▏| 800/4000 [02:17<09:08,  5.83it/s, v_num=0, default_validation/r
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.29it/s]
Epoch 3:  40%|▍| 1600/4000 [04:59<07:28,  5.35it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.28it/s]
Epoch 3:  60%|▌| 2400/4000 [07:40<05:07,  5.21it/s, v_num=0, default_validation/
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation: |                                             | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/250 [00:00<?, ?it/s]
Validation DataLoader 0: 100%|████████████████| 250/250 [00:24<00:00, 10.17it/s]
Epoch 3:  60%|▌| 2400/4000 [08:05<05:23,  4.94it/s, v_num=0, default_validation/
Evaluating the finetuned model performance
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
ft_ckpt: /mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt, test_metrics: {'pearsonr': 0.8857296109199524, 'spearmanr': 0.8241026997566223, 'mse': 0.32735225558280945, 'mae': 0.40540555119514465, 'r2': 0.7272971742515542}
Attempt metrics: pearsonr=0.8857296109199524
Accepted run (pearsonr=0.8857 >= 0.2).

Finished stage 2: obtained a fine-tuned ChromBERT
Best pearsonr=0.8857296109199524, metrics={'pearsonr': 0.8857296109199524, 'spearmanr': 0.8241026997566223, 'mse': 0.32735225558280945, 'mae': 0.40540555119514465, 'r2': 0.7272971742515542, 'ft_ckpt': '/mnt/Storage2/home/chenqianqian/projects/chrombert/chrombert_tools/ChromBERT-tools/examples/cli/output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt'}
Finished stage 2 (trained)

[ ]:

### Factor importance ranking using the transcriptome modality

odir = "./output_find_driver_in_transition"
exp_odir = f'{odir}/exp'
exp_results_odir = f"{exp_odir}/results"
exp_rank_df = pd.read_csv(os.path.join(exp_results_odir, "factor_importance_rank.csv"))
exp_rank_df

	factors	similarity	rank	embedding_shift
0	histone lysine acetylation	0.921118	1	0.078882
1	histone lysine crotonylation	0.923516	2	0.076484
2	cbx6	0.929447	3	0.070553
3	cbx7	0.933499	4	0.066501
4	h2ak5ac	0.943805	5	0.056195
...	...	...	...	...
1067	snapc4	0.996417	1068	0.003583
1068	foxo4	0.996558	1069	0.003442
1069	id2	0.996619	1070	0.003381
1070	mafg	0.996837	1071	0.003163
1071	tfcp2	0.996966	1072	0.003034

1072 rows × 4 columns

[ ]:

### Factor importance ranking using the chromatin accessibility modality

acc_odir = f'{odir}/acc'
acc_results_odir = f"{acc_odir}/results"
acc_rank_df = pd.read_csv(os.path.join(acc_results_odir, "factor_importance_rank.csv"))
acc_rank_df

	factors	similarity	rank	embedding_shift
0	pax3-foxo1a	0.140829	1	0.859171
1	myog	0.150968	2	0.849032
2	myf5	0.181639	3	0.818361
3	myod1	0.267095	4	0.732905
4	pax7	0.304148	5	0.695852
...	...	...	...	...
1067	znf26	0.983500	1068	0.016500
1068	tox4	0.983787	1069	0.016213
1069	znf266	0.984463	1070	0.015537
1070	zbed4	0.984526	1071	0.015474
1071	pitx1	0.985094	1072	0.014906

1072 rows × 4 columns

[ ]:

### Factor importance ranking using the transcriptome and chromatin accessibility modalities

import pandas as pd
merge_odir=f'{odir}/merge'
merge_df = pd.read_csv(os.path.join(merge_odir, "factor_importance_rank.csv"))
merge_df.head(n=10)

	factors	similarity_exp	rank_exp	embedding_shift_exp	similarity_acc	rank_acc	embedding_shift_acc	total_rank
0	chd4	0.975052	25	0.024948	0.470624	15	0.529376	1
1	yap1	0.978078	34	0.021922	0.420340	13	0.579660	2
2	neurog2	0.980249	37	0.019751	0.402066	10	0.597934	2
3	myf5	0.983809	54	0.016191	0.181639	3	0.818361	4
4	h3k36me1	0.957156	8	0.042844	0.802063	57	0.197937	5
5	esco2	0.977604	33	0.022396	0.703099	34	0.296901	6
6	tcf21	0.982792	48	0.017208	0.538058	20	0.461942	7
7	pgbd3	0.984118	56	0.015882	0.418980	12	0.581020	7
8	tead1	0.985233	67	0.014767	0.357693	8	0.642307	9
9	myod1	0.986058	72	0.013942	0.267095	4	0.732905	10

[ ]:

### merge the two modality
merge_df = pd.merge(exp_rank_df,acc_rank_df,on='factors',how='inner',suffixes=['_exp','_acc'])
merge_df

	factors	similarity_exp	rank_exp	embedding_shift_exp	similarity_acc	rank_acc	embedding_shift_acc
0	histone lysine acetylation	0.921118	1	0.078882	0.937713	195	0.062287
1	histone lysine crotonylation	0.923516	2	0.076484	0.941648	217	0.058352
2	cbx6	0.929447	3	0.070553	0.894013	98	0.105987
3	cbx7	0.933499	4	0.066501	0.901252	108	0.098748
4	h2ak5ac	0.943805	5	0.056195	0.913511	130	0.086489
...	...	...	...	...	...	...	...
1067	snapc4	0.996417	1068	0.003583	0.971879	823	0.028121
1068	foxo4	0.996558	1069	0.003442	0.972842	844	0.027158
1069	id2	0.996619	1070	0.003381	0.972055	828	0.027945
1070	mafg	0.996837	1071	0.003163	0.978791	972	0.021209
1071	tfcp2	0.996966	1072	0.003034	0.977366	936	0.022634

1072 rows × 7 columns

[18]:

merge_df['total_rank']=((merge_df['rank_exp']+merge_df['rank_acc'])/2).rank().astype(int)
merge_df = merge_df.sort_values('total_rank').reset_index(drop=True)
merge_df

[18]:

	factors	similarity_exp	rank_exp	embedding_shift_exp	similarity_acc	rank_acc	embedding_shift_acc	total_rank
0	chd4	0.975052	25	0.024948	0.470624	15	0.529376	1
1	yap1	0.978078	34	0.021922	0.420340	13	0.579660	2
2	neurog2	0.980249	37	0.019751	0.402066	10	0.597934	2
3	myf5	0.983809	54	0.016191	0.181639	3	0.818361	4
4	h3k36me1	0.957156	8	0.042844	0.802063	57	0.197937	5
...	...	...	...	...	...	...	...	...
1067	hes4	0.996293	1064	0.003707	0.982166	1049	0.017834	1068
1068	znf26	0.996089	1047	0.003911	0.983500	1068	0.016500	1069
1069	zbtb10	0.996412	1067	0.003588	0.982231	1052	0.017769	1070
1070	zbed4	0.996220	1057	0.003780	0.984526	1071	0.015474	1071
1071	pitx1	0.996342	1065	0.003658	0.985094	1072	0.014906	1072

1072 rows × 8 columns

Load the fine-tuned checkpoint to infer key regulators and TRN for myoblast (skip fine-tuning)¶

[ ]:

# Path pattern for fine-tuned checkpoints generated by the command above
ft_ckpt_exp_dir = "./output_find_driver_in_transition/exp/train/**/*.ckpt"
ft_ckpt_acc_dir = "./output_find_driver_in_transition/acc/train/**/*.ckpt"

import glob
ft_ckpt_exp = glob.glob(ft_ckpt_exp_dir, recursive=True)[0]
ft_ckpt_acc = glob.glob(ft_ckpt_acc_dir, recursive=True)[0]

[20]:

ft_ckpt_exp,ft_ckpt_acc

[20]:

('./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt',
 './output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt')

[ ]:

# --ft-ckpt-exp: fine-tuned checkpoint for expression modality
# --ft-ckpt-acc: fine-tuned checkpoint for chromatin accessibility modality
# --genome: genome
# --exp-tpm1: expression tpm file of cell type 1
# --exp-tpm2: expression tpm file of cell type 2
# --acc-peak1: accessibility peak file of cell type 1
# --acc-peak2: accessibility peak file of cell type 2
# --acc-signal1: accessibility signal file of cell type 1
# --acc-signal2: accessibility signal file of cell type 2
# --resolution: resolution
# --odir: output directory
# --direction: direction of transition

!chrombert-tools predict_transition_driver_regulators \
  --ft-ckpt-exp {ft_ckpt_exp} \
  --ft-ckpt-acc {ft_ckpt_acc} \
  --genome 'hg38' \
  --exp-tpm1 "../data/fibroblast_expression_sample5000.csv" \
  --exp-tpm2 "../data/myoblast_expression_sample5000.csv" \
  --acc-peak1 "../data/fibroblast_ENCFF184KAM_peak.bed" \
  --acc-peak2 "../data/myoblast_ENCFF647RNC_peak.bed" \
  --acc-signal1 "../data/fibroblast_ENCFF361BTT_signal.bigwig" \
  --acc-signal2 "../data/myoblast_ENCFF149ERN_signal.bigwig" \
  --resolution '1kb' \
  --odir output_find_driver_in_transition_load_ckpt \
  --direction "2-1" 2> "./tmp/hg38_1kb.stderr.log"

Whether to train ChromBERT to predict chromatin accessibility changes in cell state transition: True
Stage 1 (acc): prepare dataset
Processing stage 1: prepare chromatin accessibility dataset
  Mode: two states (fold-change label)
  TSS ± flank background regions: enabled, flank distance: 10000 bp
[state1_0] Region summary - total: 287892, overlapping with ChromBERT: 284160 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 5967
  [state1] merged 1 peak file(s) → 250143 unique ChromBERT regions (after overlap + dedup)
[state2_0] Region summary - total: 373422, overlapping with ChromBERT: 368260 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 7920
  [state2] merged 1 peak file(s) → 324690 unique ChromBERT regions (after overlap + dedup)
[tss_bg] Region summary - total: 19260, overlapping with ChromBERT: 333463 (one region may overlap multiple ChromBERT regions, we keep overlaps with ≥50% coverage of either the ChromBERT bin or the input region), non-overlapping: 19260
 total region: 671022
  Fast mode: downsampled to ~20000 regions
Finished Stage 1 (acc)
Stage 2 (acc): train ChromBERT to predict chromatin accessibility changes in cell state transition
Using fine-tuned checkpoint: ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
Finished stage 2 (loaded checkpoint)
Finished Stage 2 (acc)
Stage 3 (acc): infer driver factors in regions with strong vs weak accessibility change
Finished stage 3 (acc)
Whether to train ChromBERT to predict expression changes in cell state transition: True
Stage 1 (exp): prepare dataset
Processing stage 1: prepare expression dataset
  Mode: two states (log1p TPM fold change; per state, genes inner-merged across ';' files then TPM mean)
Finished Stage 1 (exp)
Stage 2 (exp): train ChromBERT to predict expression changes in cell state transition
Use fine-tuned ChromBERT checkpoint file: ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Load pretrained ckpt /mnt/Storage/home/chenqianqian/.cache/chrombert/data/checkpoint/hg38_6k_1kb_pretrain.ckpt successfully!
Loading checkpoint from ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Loading from pl module, remove prefix 'model.'
Loading from pl module, replace 'pretrain_model' with 'pretrain_model.chrombert'
Loaded 111/111 parameters
Finished stage 2 (loaded checkpoint)
Finished Stage 2 (exp)
Stage 3 (exp): infer driver factors in different expression activity genes
Finished stage 3 (exp)
Merging factor ranks from expression and chromatin accessibility
Finished merging factor ranks from expression and chromatin accessibility
Finished all stages!
Used fine-tuned ChromBERT checkpoint: ./output_find_driver_in_transition/exp/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=4-step=129.ckpt
Driver factors for expression changes: output_find_driver_in_transition_load_ckpt/exp/results/factor_importance_rank.csv
Used fine-tuned ChromBERT checkpoint: ./output_find_driver_in_transition/acc/train/try_00_seed_55/lightning_logs/lightning_logs/version_0/checkpoints/epoch=2-step=163.ckpt
Driver factors for chromatin accessibility changes: output_find_driver_in_transition_load_ckpt/acc/results/factor_importance_rank.csv
Integrated driver factor rankings: output_find_driver_in_transition_load_ckpt/merge/factor_importance_rank.csv

[ ]:

### Factor importance ranking using the transcriptome modality
odir = "./output_find_driver_in_transition_load_ckpt"
exp_odir = f'{odir}/exp'
exp_results_odir = f"{exp_odir}/results"
exp_rank_df = pd.read_csv(os.path.join(exp_results_odir, "factor_importance_rank.csv"))
exp_rank_df.head(n=10)

	factors	similarity	rank	embedding_shift
0	histone lysine acetylation	0.921118	1	0.078882
1	histone lysine crotonylation	0.923516	2	0.076484
2	cbx6	0.929447	3	0.070553
3	cbx7	0.933499	4	0.066501
4	h2ak5ac	0.943805	5	0.056195
5	h2ax	0.947999	6	0.052001
6	h3k79me1	0.955875	7	0.044125
7	h3k36me1	0.957156	8	0.042844
8	cbx8	0.957597	9	0.042403
9	h2bk20ac	0.961643	10	0.038357

[ ]:

### Factor importance ranking using the chromatin accessibility modality
acc_odir = f'{odir}/acc'
acc_results_odir = f"{acc_odir}/results"
acc_rank_df = pd.read_csv(os.path.join(acc_results_odir, "factor_importance_rank.csv"))
acc_rank_df.head(n=10)

	factors	similarity	rank	embedding_shift
0	pax3-foxo1a	0.140829	1	0.859171
1	myog	0.150968	2	0.849032
2	myf5	0.181639	3	0.818361
3	myod1	0.267095	4	0.732905
4	pax7	0.304148	5	0.695852
5	ascl1	0.345684	6	0.654316
6	dux4	0.352310	7	0.647690
7	tead1	0.357693	8	0.642307
8	neurod1	0.387828	9	0.612172
9	neurog2	0.402066	10	0.597934

[ ]:

### Factor importance ranking using the transcriptome and chromatin accessibility modalities
import pandas as pd
merge_odir=f'{odir}/merge'
merge_df = pd.read_csv(os.path.join(merge_odir, "factor_importance_rank.csv"))
merge_df.head(n=10)

	factors	similarity_exp	rank_exp	embedding_shift_exp	similarity_acc	rank_acc	embedding_shift_acc	total_rank
0	chd4	0.975052	25	0.024948	0.470624	15	0.529376	1
1	yap1	0.978078	34	0.021922	0.420340	13	0.579660	2
2	neurog2	0.980249	37	0.019751	0.402066	10	0.597934	2
3	myf5	0.983809	54	0.016191	0.181639	3	0.818361	4
4	h3k36me1	0.957156	8	0.042844	0.802063	57	0.197937	5
5	esco2	0.977604	33	0.022396	0.703099	34	0.296901	6
6	tcf21	0.982792	48	0.017208	0.538058	20	0.461942	7
7	pgbd3	0.984118	56	0.015882	0.418980	12	0.581020	7
8	tead1	0.985233	67	0.014767	0.357693	8	0.642307	9
9	myod1	0.986058	72	0.013942	0.267095	4	0.732905	10

[27]:

merge_df = pd.merge(exp_rank_df,acc_rank_df,on='factors',how='inner',suffixes=['_exp','_acc'])
merge_df

[27]:

	factors	similarity_exp	rank_exp	embedding_shift_exp	similarity_acc	rank_acc	embedding_shift_acc
0	histone lysine acetylation	0.921118	1	0.078882	0.937713	195	0.062287
1	histone lysine crotonylation	0.923516	2	0.076484	0.941648	217	0.058352
2	cbx6	0.929447	3	0.070553	0.894013	98	0.105987
3	cbx7	0.933499	4	0.066501	0.901252	108	0.098748
4	h2ak5ac	0.943805	5	0.056195	0.913511	130	0.086489
...	...	...	...	...	...	...	...
1067	snapc4	0.996417	1068	0.003583	0.971879	823	0.028121
1068	foxo4	0.996558	1069	0.003442	0.972842	844	0.027158
1069	id2	0.996619	1070	0.003381	0.972055	828	0.027945
1070	mafg	0.996837	1071	0.003163	0.978791	972	0.021209
1071	tfcp2	0.996966	1072	0.003034	0.977366	936	0.022634

1072 rows × 7 columns

[28]:

merge_df['total_rank']=((merge_df['rank_exp']+merge_df['rank_acc'])/2).rank().astype(int)
merge_df = merge_df.sort_values('total_rank').reset_index(drop=True)
merge_df

[28]:

	factors	similarity_exp	rank_exp	embedding_shift_exp	similarity_acc	rank_acc	embedding_shift_acc	total_rank
0	chd4	0.975052	25	0.024948	0.470624	15	0.529376	1
1	yap1	0.978078	34	0.021922	0.420340	13	0.579660	2
2	neurog2	0.980249	37	0.019751	0.402066	10	0.597934	2
3	myf5	0.983809	54	0.016191	0.181639	3	0.818361	4
4	h3k36me1	0.957156	8	0.042844	0.802063	57	0.197937	5
...	...	...	...	...	...	...	...	...
1067	hes4	0.996293	1064	0.003707	0.982166	1049	0.017834	1068
1068	znf26	0.996089	1047	0.003911	0.983500	1068	0.016500	1069
1069	zbtb10	0.996412	1067	0.003588	0.982231	1052	0.017769	1070
1070	zbed4	0.996220	1057	0.003780	0.984526	1071	0.015474	1071
1071	pitx1	0.996342	1065	0.003658	0.985094	1072	0.014906	1072

1072 rows × 8 columns

[ ]: