zmap.reference — Reference Data Loading

Functions for downloading, caching, and loading ZMAP reference datasets and consensus marker tables. Aliased as zmap.ref.

H5AD Loading

zmap.reference.load_zmap_h5ad(*, kind='processed_slim_tpm', url=None, dest_dir=None, filename=None, write_to_disk=True, use_cache=True, force_download=False, backed=False, chunk_size=1048576, show_progress=True, attempt_preprocess_tpmlog=True)[source]

Load a ZMAP reference dataset into memory, downloading it if necessary.

This is the primary entry point for accessing ZMAP reference data. On first call the file is downloaded and cached to Google Drive (if mounted) or a local directory. Subsequent calls in the same session are served from an in-memory cache and return instantly.

Load priority (fastest to slowest):

  1. In-memory session cache — instantaneous, no I/O.

  2. File already on disk (Drive or local) — fast, no download.

  3. Fresh download from the ZMAP CDN.

Parameters:
  • kind (str | None) –

    Preset dataset to load. One of:

    • "processed_slim_tpm" — fully processed, TPM counts only. Best default for visualization and label transfer.

    • "processed_slim" — fully processed, raw counts only.

    • "processed" — fully processed, includes intermediate layers.

    • "raw" — raw counts, unprocessed.

    • "symphony" — Symphony reference used for query embedding. Required for annotate_with_zmap.

    Ignored when url is provided.

  • url (str | None) – Explicit download URL. Overrides kind. Use this to load a custom or external H5AD not in the ZMAP registry.

  • dest_dir (str | PathLike | None) – Directory where the H5AD file is saved. Defaults to /content/drive/MyDrive/zmap/h5ad when Google Drive is mounted, or <cwd>/zmap/h5ad otherwise.

  • filename (str | None) – Override the filename used when saving to disk. Inferred from the registry or URL when not provided.

  • write_to_disk (bool) – If False, downloads to a temporary file that is deleted after loading. Useful for one-off loads when disk space is constrained. Incompatible with backed=True.

  • use_cache (bool) – If True, return the cached in-memory object on repeat calls. Set to False to force a fresh load from disk (e.g. after modifying the file externally).

  • force_download (bool) – Re-download the file even if it already exists on disk.

  • backed (bool | str) – Open the H5AD in backed (memory-mapped) mode. Pass True for read-only ("r"), or a mode string (e.g. "r+") for read-write. Backed mode avoids loading the full matrix into RAM but is slower for random access. Requires write_to_disk=True.

  • chunk_size (int) – Download chunk size in bytes.

  • show_progress (bool) – Display a tqdm progress bar while downloading.

  • attempt_preprocess_tpmlog (bool) – If the loaded object has a raw_nolog layer but no tpm_log layer, compute tpm_log via TPM normalization + log1p and add it as a layer. Has no effect if tpm_log is already present or if backed=True.

Returns:

The loaded reference dataset.

Return type:

AnnData

Examples

>>> adata_ref = zmap.ref.load_zmap_h5ad()                          # default: processed_slim_tpm
>>> adata_ref = zmap.ref.load_zmap_h5ad(kind="symphony")           # for annotate_with_zmap
>>> adata_ref = zmap.ref.load_zmap_h5ad(url="https://.../my.h5ad", filename="my.h5ad")
zmap.reference.download_zmap_h5ad(*, kind='processed_slim_tpm', url=None, dest_dir=None, filename=None, write_to_disk=True, force_download=False, chunk_size=1048576, show_progress=True)[source]

Download a ZMAP H5AD file from the CDN, with local caching.

Downloads the file to a persistent cache directory (Google Drive when available, otherwise a local directory). Subsequent calls with the same kind skip the download if the file already exists on disk.

Most users should prefer load_zmap_h5ad(), which calls this function internally and also handles loading and preprocessing.

Parameters:
  • kind (str | None) – Preset dataset key. One of the keys in H5AD_SOURCES ("raw", "processed", "processed_slim", "processed_slim_tpm", "symphony"). Ignored when url is provided.

  • url (str | None) – Explicit download URL. Overrides the registry URL looked up via kind.

  • dest_dir (str | PathLike | None) – Directory to store the downloaded file. Defaults to /content/drive/MyDrive/zmap/h5ad when Google Drive is mounted, or <cwd>/zmap/h5ad otherwise.

  • filename (str | None) – Override the filename used when saving to disk. Inferred from the registry or URL when not provided.

  • write_to_disk (bool) – If False, downloads to a temporary file that is not kept after loading.

  • force_download (bool) – Re-download the file even if it already exists on disk.

  • chunk_size (int) – Download chunk size in bytes.

  • show_progress (bool) – Display a tqdm progress bar during download.

Returns:

Path to the downloaded (or cached) H5AD file on disk.

Return type:

Path

Raises:

ValueError – If no URL can be resolved from kind or url.

Examples

>>> path = zmap.ref.download_zmap_h5ad()
>>> path = zmap.ref.download_zmap_h5ad(kind="symphony")
>>> path = zmap.ref.download_zmap_h5ad(url="https://.../my.h5ad")
zmap.reference.preprocess_tpmlog(adata)[source]

Add a tpm_log layer by normalizing raw counts to TPM + log1p.

Checks whether adata.layers["raw_nolog"] exists and adata.layers["tpm_log"] does not. When both conditions are met, performs library-size normalization to counts per million followed by log1p, and stores the result as adata.layers["tpm_log"].

This is a convenience function called automatically by load_zmap_h5ad() when attempt_preprocess_tpmlog=True.

Parameters:

adata (AnnData) – The dataset to preprocess. Modified in-place.

Notes

After this call, adata.X is cleared (set to None) so that downstream code explicitly selects a layer rather than relying on a stale .X matrix.

Consensus Markers

zmap.reference.load_consensus_markers(level='CellType', *, groups=None, marker_type='overall', n_per_group=50, min_support_ratio=None, min_log2fc=None, min_enrich=None, omit_unannotated=False, format='dict')[source]

Load ZMAP consensus marker genes for a chosen annotation level.

Marker tables are downloaded on first call and cached locally (on Google Drive when mounted, otherwise in ~/.cache/zmap_tools). Subsequent calls within the same session are served from an in-memory cache.

Parameters:
  • level (Literal['GermLayer', 'Tissue', 'CellType', 'Cluster', 'Leiden100']) –

    Annotation level whose marker table to load. One of:

    • "GermLayer" — broad germ-layer groupings.

    • "Tissue" — tissue-level groupings.

    • "CellType" — cell-type-level groupings (default).

    • "CellTypeFine" — fine-grained cell-type groupings.

    • "Cluster" — cluster-level groupings.

    • "Leiden100" — Leiden resolution-100 cluster groupings.

  • groups (Optional[Sequence[str]]) – Restrict output to a specific subset of groups at the chosen level (e.g. ["Neurons", "hepatocyte"]). Returns all groups when None.

  • marker_type (Literal['specificity', 'contrast', 'consensus', 'prevalence', 'overall']) –

    Scoring criterion used to rank and select markers. One of:

    • "overall" — composite overall rank (recommended default).

    • "specificity" — ranked by how exclusively a gene marks one group.

    • "contrast" — ranked by expression contrast vs. other groups.

    • "consensus" — ranked by agreement across studies/datasets.

    • "prevalence" — ranked by fraction of cells expressing the gene.

  • n_per_group (Optional[int]) – Maximum number of markers to return per group, taken from the top of the chosen marker_type ranking. Pass None to return all markers that pass the active filters.

  • min_support_ratio (Optional[float]) – Minimum support_ratio value required to retain a marker. Filters out genes that are not consistently expressed across studies.

  • min_log2fc (Optional[float]) – Minimum global_log2fc (fold-change vs. all other groups) required to retain a marker.

  • min_enrich (Optional[float]) – Minimum enrich_mean (mean enrichment score) required to retain a marker.

  • omit_unannotated (bool) – If True, remove genes with unannotated or placeholder names, including Ensembl IDs (ENSDARG...) and common zebrafish prefixes such as si:, zgc:, LOC, linc, wu:, bx, GRCz.

  • format (Literal['dict', 'sets', 'table', 'panel']) –

    Output format. One of:

    • "dict"{group: [gene1, gene2, ...]}

    • "sets"{group: {gene1, gene2, ...}}

    • "table" — full filtered pd.DataFrame with all scoring columns.

    • "panel" — minimal pd.DataFrame with columns ["group", "gene"], suitable for passing directly to dotplot functions.

Returns:

Structure depends on format:

  • "dict"Dict[str, List[str]]

  • "sets"Dict[str, Set[str]]

  • "table"pd.DataFrame

  • "panel"pd.DataFrame with columns ["group", "gene"]

Return type:

Any

Examples

>>> markers = zmap.ref.load_consensus_markers()                          # all CellType markers
>>> markers = zmap.ref.load_consensus_markers(level="Tissue", n_per_group=10)
>>> markers = zmap.ref.load_consensus_markers(groups=["Neurons", "hepatocyte"])
>>> df = zmap.ref.load_consensus_markers(format="panel")                # for dotplot

Data Registry

The following preset keys are available for the kind parameter:

  • "raw" — raw counts, unprocessed.

  • "processed" — fully processed, includes intermediate layers.

  • "processed_slim" — fully processed, raw counts only.

  • "processed_slim_tpm" — fully processed, TPM counts only (default).

  • "symphony" — Symphony reference for query embedding and label transfer.