zmap.predict — Label Transfer & Annotation

End-to-end pipeline and lower-level functions for transferring ZMAP reference labels to query single-cell datasets via kNN voting in Symphony/Harmony embedding space.

Full Pipeline

zmap.predict.annotate_with_zmap(adata_query, *, query_raw_counts_source, adata_ref=None, ref_kind='symphony', ref_label_col='ZMAP_CellType', label_space=None, query_truth_col=None, query_label_col=None, cluster_col=None, do_preprocess=True, do_map_embedding=True, do_ingest=True, tissue_aware=False, evaluate=False, n_neighbors=25, marker_validation=True, preprocess_kwargs=None, predict_kwargs=None, verbosity=2, debug=False, print_summary=None, show_plots=None, save_outputs=True, output_dir='zmap_predict')[source]

End-to-end ZMAP annotation pipeline: preprocess → embed → transfer labels → plot.

This is the primary entry point for annotating a new single-cell dataset with ZMAP reference labels. It chains the following steps:

  1. Preprocess — normalize raw counts to TPM + log1p (preprocess_adata_query).

  2. Embed — map the query into the ZMAP Symphony PCA embedding and ingest into the reference UMAP (requires symphonypy).

  3. Label transfer — kNN voting to assign cell-type, tissue, and time labels (predict_labels_kNN; optional tissue-aware mode via predict_labels_tissue_kNN).

  4. Summarize — store a simplified run summary in adata_query.uns['zmap_labels'][<space>]['Run Summary Simple'].

  5. Plot — overlay query cells on the reference UMAP with on-data labels (plot_embedding_with_ondata_labels).

  6. Map labels (optional) — cross-tabulate ZMAP labels against an existing query labeling (e.g. Leiden clusters) via map_query_labels.

All run parameters are stored in adata_query.uns['zmap_labels'][label_space]['_run_config'] so that on-demand accessors (plot_qc, plot_embedding, plot_time, plot_overlap_matrix, show_summary) can reproduce pipeline outputs with just adata_query — no extra arguments needed.

Parameters:
  • adata_query (ad.AnnData) – Query dataset to annotate. Modified in-place.

  • query_raw_counts_source (str) – Where raw integer counts are stored in adata_query. Pass "X" to use adata_query.X, or a layer name (e.g. "counts") to use adata_query.layers[query_raw_counts_source]. Required — no default.

  • adata_ref (ad.AnnData | None) – Pre-loaded ZMAP reference object. When None, the reference is loaded automatically using load_zmap_h5ad(kind=ref_kind).

  • ref_kind (str) – Which reference preset to load when adata_ref=None. Passed to load_zmap_h5ad. Use "symphony" for label transfer.

  • ref_label_col (str) – Column in the reference obs whose labels are transferred to the query. Also controls which UMAP overlay plot is generated.

  • label_space (str | None) – Namespace for output columns and uns keys. Defaults to ref_label_col.

  • query_truth_col (str | None) – Ground-truth label column in adata_query.obs, used for evaluation metrics when evaluate=True.

  • query_label_col (str | None) – Column in adata_query.obs containing user-defined cluster or label IDs (e.g. "leiden"). When provided, enables cluster-level consensus aggregation and the label-overlap matrix. Recommended for most workflows.

  • cluster_col (str | None) – Deprecated alias for query_label_col.

  • do_preprocess (bool) – Run TPM normalization + log1p on the query before mapping. Set to False if adata_query.X is already log-normalized.

  • do_map_embedding (bool) – Run Symphony embedding mapping. Requires symphonypy. Set to False if the query already has a X_pca_harmony embedding.

  • do_ingest (bool) – Ingest the query into the reference UMAP after Symphony mapping. Only applies when do_map_embedding=True.

  • tissue_aware (bool) – Use tissue-aware kNN transfer (predict_labels_tissue_kNN). Equivalent to predict_kwargs={"use_tissue_aware_knn": True, "auto_pseudo_tissue": True}. When True, any additional tissue-aware options can still be passed via predict_kwargs.

  • evaluate (bool) – Compute accuracy and evaluation metrics against query_truth_col. Requires query_truth_col to be set. Equivalent to predict_kwargs={"evaluate": True, "plot_eval_curves": True}.

  • n_neighbors (int) – Number of nearest neighbors for kNN label voting. With Gaussian distance weighting (the default), 25 is robust — distant neighbors are downweighted automatically, so the effective neighborhood adapts to local density.

  • marker_validation (bool) – Validate predicted labels by comparing DE markers against the ZMAP consensus marker ledger. Discovers the top 20 DE genes per predicted group and measures overlap with the top 100 reference markers. Results are stored in adata_query.uns['zmap_labels'][label_space]['Marker Validation'].

  • preprocess_kwargs (Mapping[str, Any] | None) – Extra keyword arguments forwarded to preprocess_adata_query (e.g. {"strict_counts": True}).

  • predict_kwargs (Mapping[str, Any] | None) – Extra keyword arguments forwarded to predict_labels_kNN. For common options, prefer the top-level tissue_aware and evaluate parameters instead of passing dicts manually.

  • verbosity (int) –

    Controls how much output is printed and displayed inline:

    • 0 — silent (no print, no inline plots).

    • 1 — progress lines only ([ZMAP] Step complete (Xs)).

    • 2 — compact summary + UMAP overlay + combined QC figure.

    • 3 — full display: all tables via display(), all plots including heatmap.

  • debug (bool) – If True, re-raise exceptions from plotting and aggregation steps instead of catching them. Useful for development and troubleshooting.

  • print_summary (bool | None) – Deprecated. Use verbosity instead. When explicitly set, False caps verbosity at 0.

  • show_plots (bool | None) – Deprecated. Use verbosity instead. When explicitly set, False caps verbosity at 1.

  • save_outputs (bool) – Save cell annotations CSV, cluster summary CSV, and all figures to {output_dir}/{label_space}/.

  • output_dir (str)

Returns:

The annotated query dataset (same object, modified in-place). Key additions to adata_query:

  • .obs[f"{label_space}_predicted"] — transferred cell labels.

  • .obs[f"{label_space}_prob"] — label confidence (0–1).

  • .obs["ZMAP_time_id_predicted"] — predicted time (hpf).

  • .obsm["X_umap"] — UMAP coordinates (if ingested).

  • .uns['zmap_labels']['_last_space'] — most recent label_space.

  • .uns['zmap_labels'][label_space]['_run_config'] — stored run parameters for zero-arg on-demand plot accessors.

  • .uns['zmap_labels'][label_space]['Run Summary Simple'] — key/value run summary.

  • .uns['zmap_labels'][label_space]['Cell Annotations'] — per-cell table.

  • .uns['zmap_labels'][label_space]['Cluster Summary'] — cluster consensus table (only when query_label_col is provided).

  • .uns['zmap_labels'][label_space]['Label Mapping'] — label overlap matrix (only when query_label_col is provided).

  • .uns['zmap_labels'][label_space]['Marker Validation'] — DE marker overlap with ZMAP reference ledger (only when marker_validation=True).

Return type:

ad.AnnData

Examples

Minimal usage:

>>> adata = zmap.predict.annotate_with_zmap(
...     adata_query,
...     query_raw_counts_source="counts",
...     query_label_col="leiden",
... )

Tissue-aware mode:

>>> adata = zmap.predict.annotate_with_zmap(
...     adata_query,
...     query_raw_counts_source="counts",
...     tissue_aware=True,
... )

Evaluation mode with ground-truth labels:

>>> adata = zmap.predict.annotate_with_zmap(
...     adata_query,
...     query_raw_counts_source="counts",
...     query_truth_col="manual_annotation",
...     evaluate=True,
... )

Re-plot any output with zero arguments:

>>> zmap.predict.plot_qc(adata)
>>> zmap.predict.plot_embedding(adata)
>>> zmap.predict.plot_overlap_matrix(adata)
>>> zmap.predict.show_summary(adata)

Preprocessing

zmap.predict.preprocess_adata_query(adata_query, *, counts_source, target_sum=1000000.0, inplace=True, integer_tol=0.001, strict_counts=False)[source]

Normalize raw counts in a query AnnData for ZMAP/Symphony label transfer.

Reads raw counts from the specified location, performs library-size normalization (TPM-style) followed by log1p, and writes the result into adata.X. Preprocessing metadata is recorded in adata.uns['ZMAP_preprocessing']['query'].

This function is called automatically by annotate_with_zmap when do_preprocess=True. Call it manually only if you need fine-grained control over normalization before running the pipeline.

Parameters:
  • adata_query (AnnData) – Query dataset. Modified in-place when inplace=True.

  • counts_source (str) – Where raw integer counts are stored. Pass "X" to use adata.X, or a layer name (e.g. "counts") to use adata.layers[counts_source]. This parameter is required and has no default — you must be explicit.

  • target_sum (float) – Library size each cell is normalized to before log1p. The default produces TPM-scale values (counts per million).

  • inplace (bool) – If True, modify adata_query in-place and return it. If False, operate on a copy and return the copy.

  • integer_tol (float) – Tolerance used when checking whether values are integer-like. Values deviating from the nearest integer by more than this amount count towards the non-integer fraction.

  • strict_counts (bool) – If True, raise a ValueError when the data contains NaN/inf, negative values, or appears non-integer-like (> 1% of non-zero values deviate from an integer). If False, emit a warning instead.

Returns:

The preprocessed AnnData (same object when inplace=True).

Return type:

AnnData

Raises:
  • KeyError – If counts_source is not "X" and is not found in adata.layers.

  • TypeError – If the raw data is not numeric.

  • ValueError – If strict_counts=True and data quality checks fail.

Notes

After this call, adata.X contains log-normalized (TPM + log1p) values regardless of what was in adata.X before. The original counts in counts_source are not modified.

kNN Label Transfer

zmap.predict.predict_labels_tissue_kNN(adata_query, adata_ref, *, ref_label_col, label_space=None, query_truth_col=None, ref_basis='X_pca_harmony', query_basis='X_pca_harmony', label_suffix=None, time_labels='time_id', n_neighbors=25, metric='cosine', ref_latent_key=None, query_latent_key=None, k=None, knn_metric=None, tissue_col=None, tissue_mode='hard', ref_tissue_col='ZMAP_Tissue', query_tissue_col='ZMAP_Tissue', tissue_penalty_lambda=1.0, hard_fallback_min_cells=10, knn_backend='auto', knn_device='auto', knn_nprobe=None, knn_l2norm=False, class_prior_alpha=0.0, pseudo_tissue_k=None, pseudo_tissue_threshold=0.0, pseudo_tissue_margin_threshold=0.0, auto_pseudo_tissue=True, fallback_to_plain_knn=True, pseudo_tissue_unknown_label='unknown', reuse_knn_cache=True, confidence_threshold=None, margin_threshold=0.0, include_unassigned=False, run_time_prediction=False, time_col='time_group_id', time_order=None, time_topk=5, time_hard_topk=5, time_trim_extremes=1, time_tau=0.0, time_monotone_delta=0, time_monotone_gamma=1.0, omit_labels=['unknown', 'nan', 'unassigned'], class_balance=None, time_balance=None, balance_gamma=1, balance_eps=1e-09, vote_weighting='gaussian', vote_sigma=None, time_stat_function='trimmed_mean', time_trim_alpha=0.25, time_winsor_alpha=0.25, time_distance='gaussian', time_sigma=None, time_inv_eps=1e-06, time_inv_power=1.0, evaluate=False, plot_eval_curves=False, plot_mapping_qc=True, save_mapping_qc=True, show_qc_plots=True, p_thresh=0.8, d_thresh=None, min_cells_per_label=15, apply_filters=True, output_dir='zmap_predict')[source]

Tissue-aware variant of step-3 label transfer.

This function computes a tissue-aware neighbor graph from the step-2 embedding (query_basis), caches it into adata_query.uns[‘zmap_neighbors’], then reuses predict_labels_kNN(…) for voting/QC/summary so step-4 inputs remain unchanged.

Parameters:
  • ref_label_col (str)

  • label_space (str | None)

  • query_truth_col (str | None)

  • ref_basis (str)

  • query_basis (str)

  • label_suffix (str | None)

  • time_labels (str)

  • n_neighbors (int)

  • metric (str)

  • ref_latent_key (str | None)

  • query_latent_key (str | None)

  • k (int | None)

  • knn_metric (str | None)

  • tissue_col (str | None)

  • tissue_mode (str)

  • ref_tissue_col (str)

  • query_tissue_col (str)

  • tissue_penalty_lambda (float)

  • hard_fallback_min_cells (int | None)

  • knn_backend (str)

  • knn_device (str)

  • knn_nprobe (int | None)

  • knn_l2norm (bool)

  • class_prior_alpha (float)

  • pseudo_tissue_k (int | None)

  • pseudo_tissue_threshold (float)

  • pseudo_tissue_margin_threshold (float)

  • auto_pseudo_tissue (bool)

  • fallback_to_plain_knn (bool)

  • pseudo_tissue_unknown_label (str)

  • reuse_knn_cache (bool)

  • confidence_threshold (float | None)

  • margin_threshold (float)

  • include_unassigned (bool)

  • run_time_prediction (bool)

  • time_col (str)

  • time_order (str | list[str] | None)

  • time_topk (int)

  • time_hard_topk (int)

  • time_trim_extremes (int)

  • time_tau (float)

  • time_monotone_delta (int)

  • time_monotone_gamma (float)

  • omit_labels (list[str] | None)

  • class_balance (str | None)

  • time_balance (str | None)

  • balance_gamma (float)

  • balance_eps (float)

  • vote_weighting (str | None)

  • vote_sigma (float | None)

  • time_stat_function (str)

  • time_trim_alpha (float)

  • time_winsor_alpha (float)

  • time_distance (str | None)

  • time_sigma (float | None)

  • time_inv_eps (float)

  • time_inv_power (float)

  • evaluate (bool)

  • plot_eval_curves (bool)

  • plot_mapping_qc (bool)

  • save_mapping_qc (bool)

  • show_qc_plots (bool)

  • p_thresh (float | None)

  • d_thresh (float | None)

  • min_cells_per_label (int)

  • apply_filters (bool)

  • output_dir (str)

zmap.predict.predict_labels_kNN(adata_query, adata_ref, *, ref_label_col, label_space=None, query_truth_col=None, ref_basis='X_pca_harmony', query_basis='X_pca_harmony', label_suffix=None, time_labels='time_id', n_neighbors=25, metric='cosine', knn_backend='auto', knn_device='auto', knn_nprobe=None, omit_labels=['unknown', 'nan', 'unassigned'], class_balance=None, time_balance=None, balance_gamma=1, balance_eps=1e-09, vote_weighting='gaussian', vote_sigma=None, time_stat_function='trimmed_mean', time_trim_alpha=0.25, time_winsor_alpha=0.25, time_distance='gaussian', time_sigma=None, time_inv_eps=1e-06, time_inv_power=1.0, evaluate=False, plot_eval_curves=False, plot_mapping_qc=True, save_mapping_qc=True, show_qc_plots=True, p_thresh=0.8, d_thresh=None, min_cells_per_label=15, apply_filters=True, output_dir='zmap_predict', expected_cache_mode='none')[source]

Transfer cell-type labels from a reference to a query dataset using kNN voting.

Builds a kNN index over the reference embedding, votes on labels using distance-weighted nearest neighbors, and writes per-cell predictions and confidence scores into adata_query.obs. Reference cells with excluded labels (omit_labels) are removed from the index before building it, ensuring clean 1/k probability steps in the vote tallies.

Results are stored under adata_query.uns['zmap_labels'][label_space].

Parameters:
  • adata_query (anndata.AnnData) – Query dataset to annotate.

  • adata_ref (anndata.AnnData) – Reference dataset providing labels and the embedding basis.

  • ref_label_col (str) – Column in adata_ref.obs containing the labels to transfer.

  • label_space (str | None) – Namespace used for output columns and uns keys. Defaults to ref_label_col when None.

  • query_truth_col (str | None) – Optional ground-truth label column in adata_query.obs used for evaluation metrics when evaluate=True.

  • ref_basis (str) – obsm key in adata_ref containing the reference embedding.

  • query_basis (str) – obsm key in adata_query containing the query embedding.

  • label_suffix (str | None) – Suffix appended to the predicted label column name in adata_query.obs.

  • time_labels (str) – Column in adata_ref.obs containing numeric developmental time values for time-score aggregation.

  • n_neighbors (int) – Number of nearest neighbors used for voting.

  • metric (str) – Distance metric for the kNN index. Passed directly to the underlying nearest-neighbor library.

  • omit_labels (list[str] | None) – Labels in ref_label_col to exclude from the kNN index entirely. Cells carrying these labels are removed before index construction.

  • class_balance (str | None) – Strategy for reweighting votes by class frequency. None applies no reweighting; "global_inverse" upweights underrepresented classes.

  • time_balance (str | None) – Strategy for reweighting votes by time-point frequency. Options mirror class_balance.

  • balance_gamma (float) – Exponent applied to inverse-frequency weights. Higher values increase the strength of balancing.

  • vote_weighting (str | None) – Distance weighting scheme applied to neighbor votes during label transfer. None uses uniform 1/k voting (discrete probabilities); "gaussian" applies a Gaussian kernel (continuous probabilities, recommended); "inverse" uses inverse-distance weights. Gaussian weighting produces better-calibrated confidence scores, smoother ROC/PR curves, and makes d_thresh unnecessary.

  • vote_sigma (float | None) – Bandwidth for the Gaussian kernel when vote_weighting="gaussian". If None, uses the per-cell median neighbor distance (adaptive).

  • time_stat_function (str) – Aggregation function for predicting a continuous time score per cell. One of "mean", "median", "trimmed_mean", "winsor_mean".

  • time_trim_alpha (float) – Trim fraction used when time_stat_function="trimmed_mean". Must be in [0, 0.5).

  • time_winsor_alpha (float) – Winsorization fraction used when time_stat_function="winsor_mean". Must be in [0, 0.5).

  • time_distance (str | None) – Distance weighting scheme applied to neighbors when computing the time score. None uses uniform weights; "gaussian" applies a Gaussian kernel; "inverse" uses inverse-distance weights.

  • time_sigma (float | None) – Bandwidth for the Gaussian kernel. If None, uses the per-cell median neighbor distance.

  • evaluate (bool) – Compute accuracy and other evaluation metrics against query_truth_col. Requires query_truth_col to be set.

  • plot_eval_curves (bool) – Plot confidence-threshold curves when evaluate=True.

  • plot_mapping_qc (bool) – Plot per-cell confidence and distance QC distributions after prediction.

  • save_mapping_qc (bool) – Save QC plots to ./zmap/predict/.

  • show_qc_plots (bool) – Call plt.show() for QC plots. Set to False when display is managed by a higher-level wrapper (e.g. annotate_with_zmap).

  • p_thresh (float | None) – Minimum vote probability required to assign a label. Cells below this threshold are marked as unassigned. With vote_weighting="gaussian", this is the only filter needed.

  • d_thresh (float | None) – Deprecated. Maximum allowable mean distance to neighbors. Kept for backward compatibility but redundant when vote_weighting is set, as distance information is already incorporated into the vote probabilities.

  • min_cells_per_label (int) – Minimum number of reference cells a label must have to be included in voting. Labels with fewer cells are treated as omit_labels.

  • apply_filters (bool) – Apply p_thresh filter to produce the final predicted label column. Set to False to retain raw predictions.

  • knn_backend (str)

  • knn_device (str)

  • knn_nprobe (int | None)

  • balance_eps (float)

  • time_inv_eps (float)

  • time_inv_power (float)

  • output_dir (str)

  • expected_cache_mode (str)

Returns:

Results are written directly into adata_query:

  • adata_query.obs[f"{label_space}_predicted"] — predicted labels.

  • adata_query.obs[f"{label_space}_prob"] — top-label vote probability.

  • adata_query.obs["ZMAP_time_id_predicted"] — predicted developmental time.

  • adata_query.uns['zmap_labels'][label_space] — full run metadata.

Return type:

None

Post-processing & Summaries

zmap.predict.summarize_knn_run(adata_query, label_key)[source]

Return a concise summary table for a completed kNN label-transfer run.

Reads the run metadata stored in adata_query.uns['zmap_labels'][label_key] and formats the key statistics as a two-column DataFrame.

Parameters:
  • adata_query (anndata.AnnData) – Query dataset that has been annotated by predict_labels_kNN or annotate_with_zmap.

  • label_key (str) – The label_space used when the prediction was run (matches the key under adata_query.uns['zmap_labels']).

Returns:

Two-column table with columns ["Key", "Value"] containing:

  • label_space — label namespace used.

  • n_neighbors — number of neighbors in the kNN run.

  • metric — distance metric used.

  • p_thresh — probability threshold applied.

  • n_assigned — number of cells that received a label.

  • pct_assigned — percentage of cells that received a label.

Return type:

pd.DataFrame

Raises:

KeyError – If label_key is not found in adata_query.uns['zmap_labels'], or if the run metadata is missing a "Run Summary" entry.

zmap.predict.aggregate_by_cluster(adata_query, cluster_col, label_space, *, save_csv=True, output_dir='zmap_predict')[source]

Aggregate cell-level ZMAP annotations to cluster-level consensus calls.

For each cluster in cluster_col, identifies the plurality label among all QC-assigned (non-NA) cells, computes the fraction of assigned cells carrying that label (consensus fraction), the mean per-cell kNN vote probability for those cells, and the margin over the second-ranked label. Also reports raw coverage counts so the user can assess per-cluster annotation quality (e.g., clusters where most cells were rejected).

Parameters:
  • adata_query (AnnData) – Query dataset annotated by predict_labels_kNN or annotate_with_zmap.

  • cluster_col (str) – Column in adata_query.obs containing user-defined cluster IDs (e.g. "leiden").

  • label_space (str) – Label namespace used during prediction (must match adata_query.uns['zmap_labels'][label_space]). Used to derive the predicted-label and probability column names.

  • save_csv (bool) – Write the cluster summary table to ./zmap/predict/{label_space}_cluster_summary.csv.

  • output_dir (str)

Returns:

One row per cluster, sorted by cluster ID, with columns:

  • cluster — cluster identifier.

  • n_cells_total — total cells in cluster.

  • n_cells_assigned — cells with a non-NA predicted label (passed QC).

  • pct_assigned — percentage of cells that passed QC.

  • top_label — plurality ZMAP label among assigned cells.

  • top_fraction — fraction of assigned cells carrying the top label.

  • mean_prob — mean kNN vote probability of top-label cells.

  • margintop_fractionsecond_fraction; NaN when fewer than 2 distinct labels are present.

  • second_label — second-ranked label; NaN when only one label is present.

  • second_fraction — fraction of second-ranked label; NaN when only one label is present.

Return type:

DataFrame

Raises:

KeyError – If cluster_col or the predicted-label column derived from label_space is not found in adata_query.obs.

Notes

The aggregation operates only on cells whose predicted label is non-NA (i.e., cells that passed QC filters in predict_labels_kNN). Rejected cells are counted in n_cells_total but excluded from voting, so that top_fraction and margin reflect the confidence of the accepted predictions rather than being diluted by noise.

mean_prob reflects the mean per-cell kNN vote probability for top-label cells only, and is distinct from top_fraction. top_fraction captures cluster-level consensus (how unanimously assigned cells agree); mean_prob captures how confident the kNN classifier was for those individual cells.

zmap.predict.build_cell_annotations_table(adata_query, label_space, *, cluster_col=None, time_col='ZMAP_time_id_predicted', save_csv=True, output_dir='zmap_predict')[source]

Build a concise per-cell annotation table from a completed ZMAP run.

Extracts the annotation-relevant columns from adata_query.obs into a clean, self-contained DataFrame suitable for inspection, CSV export, or downstream analysis. Only annotation columns produced by ZMAP are included — the full obs is not copied.

Parameters:
  • adata_query (AnnData) – Annotated query dataset.

  • label_space (str) – Label namespace used during prediction (matches adata_query.uns['zmap_labels'][label_space]).

  • cluster_col (str | None) – If provided, include this column (e.g. "leiden") as the first data column so that cells can be linked back to user-defined clusters.

  • time_col (str) – Column in adata_query.obs containing predicted developmental time. Must match the column written by predict_labels_kNN (which depends on time_labels and label_suffix).

  • save_csv (bool) – Write the table to {output_dir}/{label_space}_cell_annotations.csv.

  • output_dir (str)

Returns:

One row per cell. cell_id is the obs index (cell barcode). Additional columns are included when present in adata_query.obs:

  • {cluster_col} — user-defined cluster ID (if provided).

  • {label_space}_predicted — assigned label (NA if rejected).

  • {label_space}_prob — kNN vote probability (0–1).

  • {label_space}_reject_flagTrue if cell failed QC.

  • {label_space}_reason — which filter triggered rejection.

  • {time_col} — predicted developmental time (hpf).

Return type:

DataFrame

Visualization

zmap.predict.plot_embedding_with_ondata_labels(adata_ref, adata_test, *, color_key='ZMAP_Tissue_predicted', basis='X_umap', filter_na=True, palette=None, palette_uns_key=None, show_time_strip=True, time_key='ZMAP_time_id', time_strip_width_ratio=0.03, time_strip_kwargs=None, figsize=(6, 6), dpi=200, ref_size=2, ref_alpha=0.3, test_size=2, test_alpha=1.0, cmap='jet', frameon=False, sort_order=True, legend_loc='on data', legend_fontsize=5, legend_fontweight='normal', show_labels=True, recolor_labels_from_palette=True, text_stroke_width=1.0, replace_underscores=True, linebreak_from='_', linebreak_to='\\n', adjust_expand=(1.2, 1.5), arrowprops=None, min_arrow_len=0, match_arrow_color_to_text=True, arrow_alpha=0.8, ref_kwargs=None, test_kwargs=None, show=False, save=True, return_ax=False, output_dir='zmap_predict')[source]

Plot a query dataset overlaid on the reference embedding, with on-data labels and an optional vertical time distribution strip.

Renders two layers: (1) the full reference embedding as a faint grey background for spatial context, and (2) the query cells colored by a predicted label column. Labels are drawn directly on the embedding using adjustText to minimize overlap. A vertical colorbar histogram of predicted developmental time (ZMAP_time_id) can optionally be added as a strip on the right side of the figure.

Parameters:
  • adata_ref (anndata.AnnData) – Reference dataset, used only for the background embedding.

  • adata_test (anndata.AnnData) – Query dataset with predicted labels to overlay.

  • color_key (str) – Column in adata_test.obs containing the categorical labels to color and annotate. Typically a _predicted column from predict_labels_kNN.

  • basis (str) – obsm key used for the 2D embedding coordinates in both datasets.

  • filter_na (bool) – Drop query cells with NaN in color_key before plotting.

  • palette (dict | None) – Explicit {label: color} mapping. When None, the palette is resolved via sync_zmap_colors.

  • palette_uns_key (str | None) – uns key to look up the palette in adata_test. Inferred from color_key when None.

  • show_time_strip (bool) – Draw a vertical colorbar histogram of adata_test.obs[time_key] on the right side of the figure.

  • time_key (str) – Column in adata_test.obs containing predicted developmental time values (hours post-fertilization) for the time strip.

  • time_strip_width_ratio (float) – Width of the time strip as a fraction of the total figure width.

  • time_strip_kwargs (dict | None) – Additional keyword arguments forwarded to plot_colorbar_histogram.

  • figsize (tuple[float, float]) – Figure size in inches (width, height).

  • dpi (int) – Figure resolution.

  • ref_size (float) – Scatter point size for reference background cells.

  • ref_alpha (float) – Opacity of reference background points. Lower values push the reference further into the background.

  • test_size (float) – Scatter point size for query (projected) cells.

  • test_alpha (float) – Opacity of query overlay points.

  • cmap (str) – Colormap used for the reference background scatter.

  • legend_loc (str) – Where to place the category legend. "on data" draws labels directly at centroid positions; other values follow matplotlib legend conventions. Ignored when show_labels=False (forced to "none").

  • legend_fontsize (float) – Font size and weight for on-data legend labels.

  • legend_fontweight (str) – Font size and weight for on-data legend labels.

  • show_labels (bool) – If True, draw on-data text labels at category centroids with adjustText repositioning and optional arrow connectors. If False, suppress all text labels and arrows — only the colored scatter is shown, which is useful for clean figures or when the number of categories is too large for readable labels.

  • replace_underscores (bool) – Replace underscores in label strings with line breaks for cleaner on-data annotation.

  • adjust_expand (tuple[float, float]) – (x_expand, y_expand) passed to adjustText for label placement.

  • match_arrow_color_to_text (bool) – Color annotation arrows to match their corresponding text label.

  • ref_kwargs (dict | None) – Extra keyword arguments forwarded to the reference sc.pl.embedding call. Explicit ref_alpha takes priority over alpha in this dict.

  • test_kwargs (dict | None) – Extra keyword arguments forwarded to the query sc.pl.embedding call. Explicit test_alpha takes priority over alpha in this dict.

  • show (bool) – Call plt.show() after rendering.

  • save (bool) – Save the figure as PNG and PDF to output_dir.

  • return_ax (bool) – Return the main matplotlib.axes.Axes object.

  • frameon (bool)

  • sort_order (bool)

  • recolor_labels_from_palette (bool)

  • text_stroke_width (float)

  • linebreak_from (str)

  • linebreak_to (str)

  • arrowprops (dict | None)

  • min_arrow_len (float)

  • arrow_alpha (float)

  • output_dir (str)

Returns:

(fig, ax_umap, ax_strip) when return_ax=True, otherwise None.

Return type:

tuple or None

zmap.predict.plot_colorbar_histogram(values, *, bins=100, hist_range=None, value_min=None, value_max=None, cmap='Greys', vmin=0.0, vmax=1.0, bar_height=1.0, y_min=0, y_max=120, fig_width=8, fig_height=0.6, xlabel='Predicted Time (hpf)', xlabel_size=15, tick_label_size=15, title=None, title_size=13, log=False, nan_policy='drop', box=True, box_lw=1.2, box_color='black', ax=None)[source]

Plot a colorbar-styled horizontal histogram strip for a distribution of values.

Renders a single thin bar in which each bin is colored by bin density using a colormap, giving a compact “colorbar histogram” suitable for showing developmental time distributions alongside UMAP embeddings.

Used internally by plot_embedding_with_ondata_labels to draw the vertical time strip, but can also be called standalone.

Parameters:
  • values (array-like) – Numeric values to histogram (e.g. predicted time in hpf). Non-finite values are handled according to nan_policy.

  • bins (int or array-like, default 100) – Number of histogram bins, or explicit bin edges.

  • hist_range (tuple of float or None, default None) – (min, max) range for the histogram. Inferred from data when None.

  • value_min (float or None, default None) – If provided, clip values to [value_min, value_max] before binning. Also sets hist_range when both are given and hist_range is None.

  • value_max (float or None, default None) – If provided, clip values to [value_min, value_max] before binning. Also sets hist_range when both are given and hist_range is None.

  • cmap (str, default :py:class:``”Greys”:py:class:``) – Matplotlib colormap name used to color bins by density.

  • vmin (float, default 0.0 and 1.0) – Colormap normalization range (applied to normalized bin counts).

  • vmax (float, default 0.0 and 1.0) – Colormap normalization range (applied to normalized bin counts).

  • bar_height (float, default 1.0) – Height of the histogram bar in data units.

  • y_min (float, default 0 and 120) – Y-axis limits for the plot. y_max defaults to y_min + bar_height when set to None.

  • y_max (float, default 0 and 120) – Y-axis limits for the plot. y_max defaults to y_min + bar_height when set to None.

  • fig_width (float, default 8 and 0.6) – Figure size in inches. Only used when ax=None.

  • fig_height (float, default 8 and 0.6) – Figure size in inches. Only used when ax=None.

  • xlabel (str, default :py:class:``”Predicted Time (hpf)”:py:class:``) – X-axis label.

  • xlabel_size (float, default 15) – Font sizes for the axis label and tick labels.

  • tick_label_size (float, default 15) – Font sizes for the axis label and tick labels.

  • title (str or None, default None) – Optional title drawn above the strip.

  • title_size (float, default 13) – Font size for the title.

  • log (bool, default False) – If True, apply log1p to bin counts before coloring.

  • nan_policy (str, default :py:class:``”drop”:py:class:``) – How to handle non-finite values. Currently only "drop" is supported.

  • box (bool, default True) – Draw a bounding box around the strip.

  • box_lw (float and str, default 1.2 and :py:class:``”black”:py:class:``) – Line width and color for the bounding box.

  • box_color (float and str, default 1.2 and :py:class:``”black”:py:class:``) – Line width and color for the bounding box.

  • ax (matplotlib.axes.Axes or None, default None) – Axes to draw into. If None, a new figure and axes are created.

Returns:

The axes containing the colorbar histogram strip.

Return type:

matplotlib.axes.Axes

zmap.predict.sync_zmap_colors(adata, obs_key='ZMAP_CellType', *, ref_adata=None, ref_obs_key=None, unknown_color='#BDBDBD')[source]

Synchronize a categorical color palette between a query and reference AnnData.

Ensures that adata.uns[f"{obs_key}_colors"] is populated and aligned with the categories in adata.obs[obs_key]. The palette is sourced from adata.uns directly if already present, or copied from ref_adata if provided.

Called automatically by plot_embedding_with_ondata_labels. Call manually when you need consistent colors across multiple plots or custom figure code.

Parameters:
  • adata (anndata.AnnData) – Dataset whose color palette to set or update. Modified in-place.

  • obs_key (str, default :py:class:``”ZMAP_CellType”:py:class:``) – Column in adata.obs whose categories need a synchronized palette.

  • ref_adata (anndata.AnnData or None, default None) – Reference dataset from which to copy the palette when adata does not already have one. Looks for ref_adata.uns[f"{ref_obs_key}_color_map"] or ref_adata.uns[f"{ref_obs_key}_colors"].

  • ref_obs_key (str or None, default None) – Column in ref_adata.obs to use as the color source. Defaults to obs_key when None.

  • unknown_color (str, default :py:class:``”#BDBDBD”:py:class:``) – Hex color assigned to any category not found in the palette.

Returns:

Ordered list of hex color strings, one per category in adata.obs[obs_key].cat.categories.

Return type:

list of str

Raises:

KeyError – If no palette is found in adata.uns and ref_adata is either not provided or does not contain a matching palette.

zmap.predict.map_query_labels(adata_query, obs_A, obs_B, *, normalize='row', title=None, reorder_columns=True, reorder_rows=True, cmap=matplotlib.pyplot.cm.Blues, overlay_values=False, vmin=None, vmax=None, show_plot=True, return_df=False, figsize=8, save_plots=True, save_mapping=True, file_prefix=None, output_dir='zmap_predict')[source]

Compute and visualize the overlap between two label columns in a query AnnData.

Builds a contingency matrix comparing two categorical obs columns (e.g. ZMAP predicted labels vs. Leiden clusters), applies optional row- or column-wise normalization, and plots the result as a heatmap. Also computes a per-group best-match mapping table.

Parameters:
  • adata_query (anndata.AnnData) – Annotated query dataset containing both label columns.

  • obs_A (str) – Column in adata_query.obs used as the reference labeling (appears as columns in the overlap matrix).

  • obs_B (str) – Column in adata_query.obs used as the query labeling (appears as rows in the overlap matrix).

  • normalize (str or None, default :py:class:``”row”:py:class:``) –

    Normalization applied to the raw overlap counts before plotting. One of:

    • "row" — each row sums to 1 (fraction of obs_B in each obs_A).

    • "column" — each column sums to 1 (fraction of obs_A in each obs_B).

    • None — plot raw cell counts.

    True is treated as "row" and False as None for backward compatibility.

  • title (str or None, default None) – Plot title. Auto-generated from obs_A and obs_B when None.

  • reorder_columns (bool, default True) – Sort columns by the position of their best-matching row.

  • reorder_rows (bool, default True) – Sort rows by the position of their best-matching column.

  • cmap (matplotlib colormap, default plt.cm.Blues) – Colormap for the heatmap.

  • overlay_values (bool, default False) – Overlay numeric values in each heatmap cell.

  • vmin (float or None, default None) – Colormap normalization limits.

  • vmax (float or None, default None) – Colormap normalization limits.

  • show_plot (bool, default True) – Display the plot immediately.

  • return_df (bool, default False) – Return the best-match mapping table as a pd.DataFrame.

  • figsize (float, default 8) – Figure size (passed as both width and height in inches).

  • save_plots (bool, default True) – Save PNG and PDF of the heatmap to ./zmap/predict/.

  • save_mapping (bool, default True) – Save the best-match mapping table as a CSV to ./zmap/predict/.

  • file_prefix (str | None)

  • output_dir (str)

Returns:

When return_df=True, a per-group best-match table mapping each obs_B label to its most-overlapping obs_A label. None otherwise.

Return type:

pd.DataFrame or None