zmap.reference — Reference Data Loading
Functions for downloading, caching, and loading ZMAP reference datasets
and consensus marker tables. Aliased as zmap.ref.
H5AD Loading
- zmap.reference.load_zmap_h5ad(*, kind='processed_slim_tpm', url=None, dest_dir=None, filename=None, write_to_disk=True, use_cache=True, force_download=False, backed=False, chunk_size=1048576, show_progress=True, attempt_preprocess_tpmlog=True)[source]
Load a ZMAP reference dataset into memory, downloading it if necessary.
This is the primary entry point for accessing ZMAP reference data. On first call the file is downloaded and cached to Google Drive (if mounted) or a local directory. Subsequent calls in the same session are served from an in-memory cache and return instantly.
Load priority (fastest to slowest):
In-memory session cache — instantaneous, no I/O.
File already on disk (Drive or local) — fast, no download.
Fresh download from the ZMAP CDN.
- Parameters:
Preset dataset to load. One of:
"processed_slim_tpm"— fully processed, TPM counts only. Best default for visualization and label transfer."processed_slim"— fully processed, raw counts only."processed"— fully processed, includes intermediate layers."raw"— raw counts, unprocessed."symphony"— Symphony reference used for query embedding. Required forannotate_with_zmap.
Ignored when
urlis provided.url (
str|None) – Explicit download URL. Overrideskind. Use this to load a custom or external H5AD not in the ZMAP registry.dest_dir (
str|PathLike|None) – Directory where the H5AD file is saved. Defaults to/content/drive/MyDrive/zmap/h5adwhen Google Drive is mounted, or<cwd>/zmap/h5adotherwise.filename (
str|None) – Override the filename used when saving to disk. Inferred from the registry or URL when not provided.write_to_disk (
bool) – IfFalse, downloads to a temporary file that is deleted after loading. Useful for one-off loads when disk space is constrained. Incompatible withbacked=True.use_cache (
bool) – IfTrue, return the cached in-memory object on repeat calls. Set toFalseto force a fresh load from disk (e.g. after modifying the file externally).force_download (
bool) – Re-download the file even if it already exists on disk.backed (
bool|str) – Open the H5AD in backed (memory-mapped) mode. PassTruefor read-only ("r"), or a mode string (e.g."r+") for read-write. Backed mode avoids loading the full matrix into RAM but is slower for random access. Requireswrite_to_disk=True.chunk_size (
int) – Download chunk size in bytes.show_progress (
bool) – Display atqdmprogress bar while downloading.attempt_preprocess_tpmlog (
bool) – If the loaded object has araw_nologlayer but notpm_loglayer, computetpm_logvia TPM normalization + log1p and add it as a layer. Has no effect iftpm_logis already present or ifbacked=True.
- Returns:
The loaded reference dataset.
- Return type:
Examples
>>> adata_ref = zmap.ref.load_zmap_h5ad() # default: processed_slim_tpm >>> adata_ref = zmap.ref.load_zmap_h5ad(kind="symphony") # for annotate_with_zmap >>> adata_ref = zmap.ref.load_zmap_h5ad(url="https://.../my.h5ad", filename="my.h5ad")
- zmap.reference.download_zmap_h5ad(*, kind='processed_slim_tpm', url=None, dest_dir=None, filename=None, write_to_disk=True, force_download=False, chunk_size=1048576, show_progress=True)[source]
Download a ZMAP H5AD file from the CDN, with local caching.
Downloads the file to a persistent cache directory (Google Drive when available, otherwise a local directory). Subsequent calls with the same
kindskip the download if the file already exists on disk.Most users should prefer
load_zmap_h5ad(), which calls this function internally and also handles loading and preprocessing.- Parameters:
kind (
str|None) – Preset dataset key. One of the keys inH5AD_SOURCES("raw","processed","processed_slim","processed_slim_tpm","symphony"). Ignored whenurlis provided.url (
str|None) – Explicit download URL. Overrides the registry URL looked up viakind.dest_dir (
str|PathLike|None) – Directory to store the downloaded file. Defaults to/content/drive/MyDrive/zmap/h5adwhen Google Drive is mounted, or<cwd>/zmap/h5adotherwise.filename (
str|None) – Override the filename used when saving to disk. Inferred from the registry or URL when not provided.write_to_disk (
bool) – IfFalse, downloads to a temporary file that is not kept after loading.force_download (
bool) – Re-download the file even if it already exists on disk.chunk_size (
int) – Download chunk size in bytes.show_progress (
bool) – Display atqdmprogress bar during download.
- Returns:
Path to the downloaded (or cached) H5AD file on disk.
- Return type:
- Raises:
ValueError – If no URL can be resolved from
kindorurl.
Examples
>>> path = zmap.ref.download_zmap_h5ad() >>> path = zmap.ref.download_zmap_h5ad(kind="symphony") >>> path = zmap.ref.download_zmap_h5ad(url="https://.../my.h5ad")
- zmap.reference.preprocess_tpmlog(adata)[source]
Add a
tpm_loglayer by normalizing raw counts to TPM + log1p.Checks whether
adata.layers["raw_nolog"]exists andadata.layers["tpm_log"]does not. When both conditions are met, performs library-size normalization to counts per million followed bylog1p, and stores the result asadata.layers["tpm_log"].This is a convenience function called automatically by
load_zmap_h5ad()whenattempt_preprocess_tpmlog=True.- Parameters:
adata (
AnnData) – The dataset to preprocess. Modified in-place.
Notes
After this call,
adata.Xis cleared (set toNone) so that downstream code explicitly selects a layer rather than relying on a stale.Xmatrix.
Consensus Markers
- zmap.reference.load_consensus_markers(level='CellType', *, groups=None, marker_type='overall', n_per_group=50, min_support_ratio=None, min_log2fc=None, min_enrich=None, omit_unannotated=False, format='dict')[source]
Load ZMAP consensus marker genes for a chosen annotation level.
Marker tables are downloaded on first call and cached locally (on Google Drive when mounted, otherwise in
~/.cache/zmap_tools). Subsequent calls within the same session are served from an in-memory cache.- Parameters:
level (
Literal['GermLayer','Tissue','CellType','Cluster','Leiden100']) –Annotation level whose marker table to load. One of:
"GermLayer"— broad germ-layer groupings."Tissue"— tissue-level groupings."CellType"— cell-type-level groupings (default)."CellTypeFine"— fine-grained cell-type groupings."Cluster"— cluster-level groupings."Leiden100"— Leiden resolution-100 cluster groupings.
groups (
Optional[Sequence[str]]) – Restrict output to a specific subset of groups at the chosen level (e.g.["Neurons", "hepatocyte"]). Returns all groups whenNone.marker_type (
Literal['specificity','contrast','consensus','prevalence','overall']) –Scoring criterion used to rank and select markers. One of:
"overall"— composite overall rank (recommended default)."specificity"— ranked by how exclusively a gene marks one group."contrast"— ranked by expression contrast vs. other groups."consensus"— ranked by agreement across studies/datasets."prevalence"— ranked by fraction of cells expressing the gene.
n_per_group (
Optional[int]) – Maximum number of markers to return per group, taken from the top of the chosenmarker_typeranking. PassNoneto return all markers that pass the active filters.min_support_ratio (
Optional[float]) – Minimumsupport_ratiovalue required to retain a marker. Filters out genes that are not consistently expressed across studies.min_log2fc (
Optional[float]) – Minimumglobal_log2fc(fold-change vs. all other groups) required to retain a marker.min_enrich (
Optional[float]) – Minimumenrich_mean(mean enrichment score) required to retain a marker.omit_unannotated (
bool) – IfTrue, remove genes with unannotated or placeholder names, including Ensembl IDs (ENSDARG...) and common zebrafish prefixes such assi:,zgc:,LOC,linc,wu:,bx,GRCz.format (
Literal['dict','sets','table','panel']) –Output format. One of:
"dict"—{group: [gene1, gene2, ...]}"sets"—{group: {gene1, gene2, ...}}"table"— full filteredpd.DataFramewith all scoring columns."panel"— minimalpd.DataFramewith columns["group", "gene"], suitable for passing directly to dotplot functions.
- Returns:
Structure depends on
format:"dict"→Dict[str, List[str]]"sets"→Dict[str, Set[str]]"table"→pd.DataFrame"panel"→pd.DataFramewith columns["group", "gene"]
- Return type:
Examples
>>> markers = zmap.ref.load_consensus_markers() # all CellType markers >>> markers = zmap.ref.load_consensus_markers(level="Tissue", n_per_group=10) >>> markers = zmap.ref.load_consensus_markers(groups=["Neurons", "hepatocyte"]) >>> df = zmap.ref.load_consensus_markers(format="panel") # for dotplot
Data Registry
The following preset keys are available for the kind parameter:
"raw"— raw counts, unprocessed."processed"— fully processed, includes intermediate layers."processed_slim"— fully processed, raw counts only."processed_slim_tpm"— fully processed, TPM counts only (default)."symphony"— Symphony reference for query embedding and label transfer.