interpret_regulator_effects_between_region_groups¶
Rank regulators by comparing their embeddings between two genomic region sets.
For each regulator, ChromBERT-tools calculates its mean embedding in region set 1 and region set 2, then compares the two embeddings using cosine similarity.
A larger embedding shift suggests that the regulator has more different regulatory contexts between the two region sets.
Overview¶
interpret_regulator_effects_between_region_groups helps identify regulators whose
regulatory roles may differ between two sets of regions.
The command:
overlaps each input region set with ChromBERT bins
generates regulator embeddings for each region set
averages embeddings for each regulator within each set
compares the two mean embeddings for each regulator
ranks regulators by embedding shift
The main result is written to:
<odir>/results/factor_importance_rank.csv
Basic Usage¶
Compare two region sets¶
chrombert-tools interpret_regulator_effects_between_region_groups \
--region1-file set1.bed \
--region2-file set2.bed \
--genome hg38 \
--resolution 1kb \
--odir output
Use a fine-tuned checkpoint¶
Use --ft-ckpt to compute regulator embeddings with a fine-tuned model.
chrombert-tools interpret_regulator_effects_between_region_groups \
--region1-file set1.bed \
--region2-file set2.bed \
--ft-ckpt path/to/finetuned.ckpt \
--genome hg38 \
--resolution 1kb \
--odir output
Run with Apptainer¶
Use --nv to enable GPU access.
apptainer exec --nv /path/to/chrombert-tools.sif chrombert-tools interpret_regulator_effects_between_region_groups \
--region1-file set1.bed \
--region2-file set2.bed \
--genome hg38 \
--resolution 1kb \
--odir output
Parameters¶
Required inputs¶
--region1-file(file path, required)First region set. The file should contain at least
chrom,start, andendcolumns.--region2-file(file path, required)Second region set. The file should use the same format as
--region1-file.
Embedding options¶
--ft-ckpt(file path, optional)Fine-tuned checkpoint used to generate regulator embeddings.
If this option is not provided, ChromBERT-tools uses the pre-trained ChromBERT model.
--ignore-regulator(string, optional)Regulators to mask during embedding generation, separated by semicolons.
--gep(flag, default: False)Use the GEP multi-flank-window model.
--flank-window(int, default: 4)Flank window size used with
--gep.--model-config(file path, optional)Custom model configuration file.
--data-config(file path, optional)Custom dataset configuration file.
Reference and output options¶
--genome(hg38 | mm10, default: hg38)Reference genome.
--resolution(200bp | 1kb | 2kb | 4kb, default: 1kb)ChromBERT bin resolution. For
mm10, only1kbis currently supported.--batch-size(int, default: 4)Batch size used for model inference.
--odir(directory, default: ./output)Output directory. It will be created automatically if needed.
--chrombert-cache-dir(directory, default: ~/.cache/chrombert/data)Directory for ChromBERT reference files, model files, and cached data.
Required cache files¶
The command uses the following ChromBERT cache files:
ChromBERT reference region file
ChromBERT HDF5 feature file
pre-trained ChromBERT checkpoint
mask matrix
Outputs¶
The following files are written under <odir>.
dataset/region1/Overlap results for the first region set.
dataset/region2/Overlap results for the second region set.
emb/mean_regulator_emb_region1.pklMean regulator embeddings for region set 1.
emb/mean_regulator_emb_region2.pklMean regulator embeddings for region set 2.
results/factor_importance_rank.csvMain output file.
Each row is one regulator. The table is sorted from the largest embedding shift to the smallest embedding shift.
Main columns:
factors: regulator namesimilarity: cosine similarity between the regulator embeddings from the two region setsembedding_shift:1 - similarityrank: regulator rank; 1 means the largest embedding shift
Interpretation¶
Regulators with larger embedding_shift values show stronger differences between the
two region sets.
These top-ranked regulators may represent context-specific regulators, cofactors, or candidate drivers that distinguish the two region groups.
Tips¶
Use biologically comparable region sets when possible.
Region sets with similar size and genomic scale usually give cleaner rankings.
Use
--ft-ckptwhen you want cell-type-specific or task-specific regulator embeddings.The mean regulator embedding files can be reused for downstream analyses.
To see all options, run:
chrombert-tools interpret_regulator_effects_between_region_groups -h