gene_activity_regression¶
Fine-tune a ChromBERT gene activity regression model using TPM expression tables, or use an existing checkpoint to predict gene activity from TSS-centered representations that include the promoter region and upstream/downstream flanking regions.
The command supports two prediction targets:
single-state gene activity:
log1p(TPM)two-state gene activity change:
log1p(TPM2) - log1p(TPM1)
Overview¶
gene_activity_regression can be used in two ways:
training + prediction: train a regression model from TPM tables, then predict gene activity
predict-only: load an existing fine-tuned checkpoint and predict directly
During training, ChromBERT-tools prepares the expression dataset, trains or loads a
regression model, and writes predictions to <odir>/predict/predictions.csv.
Input expression tables¶
Each TPM file should contain at least two columns:
gene_idtpm
Column names are matched case-insensitively.
Single-state mode¶
Use only --exp-tpm1.
The prediction target is:
log1p(mean TPM)
If multiple TPM files are provided, they are treated as replicates and averaged by gene.
Two-state mode¶
Use both --exp-tpm1 and --exp-tpm2.
The prediction target is:
log1p(TPM2) - log1p(TPM1)
Use --direction to control the direction of the fold change:
2-1: keeplog1p(TPM2) - log1p(TPM1)1-2: uselog1p(TPM1) - log1p(TPM2)
Replicates¶
Multiple TPM files can be provided with semicolons:
--exp-tpm1 "rep1.csv;rep2.csv;rep3.csv"
Within each state, ChromBERT-tools keeps genes present in all replicate files and averages their TPM values.
Basic Usage¶
Train from one state¶
chrombert-tools gene_activity_regression \
--exp-tpm1 state1_tpm.csv \
--genome hg38 \
--resolution 1kb \
--odir output
Train from two states¶
chrombert-tools gene_activity_regression \
--exp-tpm1 state1_tpm.csv \
--exp-tpm2 state2_tpm.csv \
--direction 2-1 \
--genome hg38 \
--resolution 1kb \
--odir output
Train with replicates¶
chrombert-tools gene_activity_regression \
--exp-tpm1 "rep1.csv;rep2.csv;rep3.csv" \
--genome hg38 \
--resolution 1kb \
--odir output
Predict only¶
Use this mode when you already have a fine-tuned gene activity regression checkpoint.
chrombert-tools gene_activity_regression \
--ft-ckpt path/to/gep_finetuned.ckpt \
--predict-file regions_genes.tsv \
--genome hg38 \
--resolution 1kb \
--odir output_predict
Run with Apptainer¶
Use --nv to enable GPU access.
apptainer exec --nv /path/to/chrombert-tools.sif chrombert-tools gene_activity_regression \
--exp-tpm1 state1_tpm.csv \
--genome hg38 \
--resolution 1kb \
--odir output
Parameters¶
Expression inputs¶
--exp-tpm1(file path or semicolon-separated paths)State-1 TPM table(s). Required for training.
--exp-tpm2(file path or semicolon-separated paths, optional)State-2 TPM table(s). Provide this option for two-state gene activity change prediction.
--direction(2-1 | 1-2, default: 2-1)Direction of the two-state target.
2-1meanslog1p(TPM2) - log1p(TPM1).1-2meanslog1p(TPM1) - log1p(TPM2).
Prediction inputs¶
--predict-file(file path, optional)Regions or genes used for prediction.
The file should contain the following columns:
chromstartendbuild_region_indexgene_idtss
If this option is not provided after training, ChromBERT-tools predicts on the test split generated during dataset preparation.
--ft-ckpt(file path, optional)Fine-tuned gene activity regression checkpoint.
When both
--ft-ckptand--predict-fileare provided, ChromBERT-tools runs in predict-only mode and skips training.
Model options¶
--flank-window(int, default: 4)Number of flanking ChromBERT windows used for gene activity prediction.
--batch-size(int, default: 4)Batch size for training and prediction.
--mode(fast | full, default: fast)Reserved for consistency with other ChromBERT-tools commands. This command always uses the prepared train, test, and validation splits.
Reference and output options¶
--genome(hg38 | mm10, default: hg38)Reference genome.
--resolution(200bp | 1kb | 2kb | 4kb, default: 1kb)ChromBERT bin resolution. For
mm10, only1kbis currently supported.--odir(directory, default: ./output)Output directory. It will be created automatically if needed.
--chrombert-cache-dir(directory, default: ~/.cache/chrombert/data)Directory for ChromBERT reference files, model files, and cached data.
Required cache files¶
The command uses the following ChromBERT cache files:
ChromBERT reference region file
ChromBERT HDF5 feature file
metadata file
gene metadata file
The gene metadata file is required for training and test-split prediction. It is not
required in predict-only mode when both --ft-ckpt and --predict-file are provided.
Outputs¶
The following files are written under --odir.
dataset/Created during training. It contains the processed expression dataset and train, test, and validation splits.
train/Created during training. It contains model training outputs and the selected checkpoint.
predict/model_input.tsvProcessed input table used for prediction.
predict/predictions.csvMain prediction output.
The file contains gene and region metadata, predicted values, and optionally true labels if labels are available in the prediction input.
model_config.jsonModel configuration used for the run.
dataset_config.jsonDataset configuration used for the run.
Predict-only mode¶
Predict-only mode is used when both of the following are provided:
--ft-ckpt--predict-file
In this mode, ChromBERT-tools does not require TPM tables and does not train a model. It
loads the checkpoint and writes predictions directly to <odir>/predict/predictions.csv.
Tips¶
Use
--exp-tpm1for single-state gene activity prediction.Use both
--exp-tpm1and--exp-tpm2for two-state gene activity change prediction.Use semicolons to provide replicate TPM files.
Use
--ft-ckpttogether with--predict-filefor predict-only mode.--modeis kept for consistency with other commands but does not change the training behavior of this command.To see all options, run:
chrombert-tools gene_activity_regression -h