embed_region¶
Generate 768-dimensional region embeddings and/or gene embeddings using ChromBERT.
By default, this command uses the pre-trained ChromBERT model. It can also generate cell-type-specific embeddings when a fine-tuned checkpoint or cell-type accessibility data are provided.
Overview¶
You must provide at least one of the following inputs:
--region: a BED file of genomic regions--gene: gene symbols or Ensembl IDs
If both are provided, ChromBERT-tools will generate both region and gene embeddings in the same run.
Modes¶
General mode¶
General mode is used when no cell-type-specific information is provided.
In this mode, ChromBERT-tools uses the pre-trained ChromBERT model. If precomputed genome-wide region embeddings are available, the command directly extracts the requested rows from the cached embeddings. Otherwise, it loads the model and computes embeddings.
Cell-type-specific mode¶
Cell-type-specific mode is used when either of the following is provided:
--ft-ckpt: a fine-tuned checkpointboth
--cell-type-bwand--cell-type-peak: cell-type accessibility signal and peaks
If --ft-ckpt is provided, ChromBERT-tools loads the checkpoint directly and does not
perform fine-tuning.
If --cell-type-bw and --cell-type-peak are provided without --ft-ckpt,
ChromBERT-tools first fine-tunes a cell-type-specific model, then uses it to generate
embeddings.
Basic Usage¶
Regions only¶
chrombert-tools embed_region \
--region regions.bed \
--genome hg38 \
--resolution 1kb \
--odir output
Genes only¶
chrombert-tools embed_region \
--gene "TP53;BRD4" \
--genome hg38 \
--resolution 1kb \
--odir output
Regions and genes¶
chrombert-tools embed_region \
--region regions.bed \
--gene "TP53;BRD4" \
--genome hg38 \
--resolution 1kb \
--odir output
Generate cell-type-specific embeddings from accessibility data¶
chrombert-tools embed_region \
--region regions.bed \
--cell-type-bw cell_accessibility.bigwig \
--cell-type-peak cell_peaks.bed \
--genome hg38 \
--resolution 1kb \
--mode fast \
--odir output_cell
Generate cell-type-specific embeddings from a checkpoint¶
chrombert-tools embed_region \
--region regions.bed \
--ft-ckpt path/to/finetuned.ckpt \
--genome hg38 \
--resolution 1kb \
--odir output_ckpt
Run with Apptainer¶
Use --nv to enable GPU access:
apptainer exec --nv /path/to/chrombert-tools.sif chrombert-tools embed_region \
--region regions.bed \
--genome hg38 \
--resolution 1kb \
--odir output
Parameters¶
Required input¶
At least one of the following is required.
--region(file path)BED file containing genomic regions. The file should include at least
chrom,start, andendcolumns.--gene(string)Gene symbols or Ensembl IDs separated by semicolons, for example
"TP53;BRD4".
Cell-type-specific options¶
--cell-type-bw(file path)Cell-type-specific accessibility signal in BigWig format. Must be used together with
--cell-type-peakunless--ft-ckptis provided.--cell-type-peak(file path)Cell-type-specific accessibility peaks in BED format. Must be used together with
--cell-type-bwunless--ft-ckptis provided.--ft-ckpt(file path, optional)Path to a fine-tuned checkpoint. When provided, ChromBERT-tools loads this checkpoint directly and skips fine-tuning.
--mode(fast | full, default: fast)Fine-tuning mode. This option is only used when training a new cell-type-specific model from
--cell-type-bwand--cell-type-peak.
Reference and output options¶
--genome(hg38 | mm10, default: hg38)Reference genome.
--resolution(1kb | 200bp | 2kb | 4kb, default: 1kb)ChromBERT bin resolution. For
mm10, only1kbis currently supported.--batch-size(int, default: 4)Batch size used during model inference. This has no effect when cached embeddings are used directly.
--odir(directory, default: ./output)Output directory. It will be created automatically if it does not exist.
--oname(string, default: embedding)Output file name prefix.
Advanced options¶
--chrombert-cache-dir(directory, default: ~/.cache/chrombert/data)Directory for ChromBERT reference files and cached data.
--chrombert-region-file(file path, optional)Custom ChromBERT reference region BED file.
--chrombert-region-emb-file(file path, optional)Custom precomputed genome-wide region embedding file.
--chrombert-gene-meta(file path, optional)Custom gene metadata file.
Output Files¶
When --region is used¶
The following files are written to --odir:
region_emb_<oname>.npyRegion embedding array with shape
(n_regions, 768). Each row corresponds to one ChromBERT bin overlapping the input regions.overlap_region.bedInput regions that overlap ChromBERT bins.
no_overlap_region.bedInput regions that do not overlap ChromBERT bins.
model_input.tsvModel input table, generated when model inference is needed.
When --gene is used¶
The following files are written to --odir:
gene_emb_<oname>.pklPython dictionary mapping each matched gene to a 768-dimensional embedding.
overlap_genes_meta.tsvMetadata for genes matched in the ChromBERT gene annotation.
model_input_gene.tsvModel input table for gene-associated regions.
Gene embeddings are calculated by averaging the embeddings of ChromBERT bins associated with each gene promoter.
Load outputs in Python¶
import numpy as np
import pickle
region_emb = np.load("output/region_emb_embedding.npy")
print(region_emb.shape)
with open("output/gene_emb_embedding.pkl", "rb") as f:
gene_emb = pickle.load(f)
Tips¶
If you already have a fine-tuned checkpoint, use
--ft-ckptdirectly. You do not need to provide BigWig or peak files.If you need both region and gene embeddings in cell-type-specific mode, provide
--regionand--genein the same command. The model will be loaded or trained only once.For first-time gene embedding runs, ChromBERT-tools may download the required gene metadata automatically.
In general mode, cached genome-wide embeddings may be used when available. In this case, no model forward pass is needed.
To see all available options, run:
chrombert-tools embed_region -h