Skip to content

Indexers

autoindexers is a sibling package that ships inside the same distribution as autoencoders.

Use it when you already have a full embedding table and want a classical indexing or hashing backend without training another neural model.

What It Covers

Current built-in indexers are:

  • lsh: random-hyperplane locality-sensitive hashing with multiple hash tables
  • simhash: single-table random-hyperplane binary hashing
  • pcahash: principal-component hashing with optional median thresholds
  • itq: iterative quantization over PCA-projected embeddings

All current indexers:

  • consume a full embedding table with shape (num_items, dim)
  • require sample_spec=TensorSpec(shape=(dim,))
  • can build(), query(), save_pretrained(), and from_pretrained()
  • store their state on CPU for portable serialization

Relationship To autoencoders

The two packages intentionally solve different problems:

  • autoencoders learns latent representations with trainable neural models
  • autoindexers builds classical retrieval backends over a finished embedding table

Typical flows look like:

raw inputs -> autoencoders model -> learned embeddings/latents -> autoindexers backend

or:

existing embedding table -> autoindexers backend

Quick Start

import torch

from autoencoders.data.base import TensorSpec
from autoindexers import load_indexer

embeddings = torch.randn(10000, 256)
item_ids = [f"item-{index}" for index in range(embeddings.shape[0])]

indexer = load_indexer(
    "lsh",
    sample_spec=TensorSpec(shape=(256,)),
    num_bits=24,
    num_tables=6,
).build(embeddings, item_ids=item_ids)

results = indexer.query(embeddings[0], top_k=5)
print(results[0].item_ids)
print(results[0].scores)

Shared API

load_indexer(name, sample_spec=..., **kwargs)

Create an indexer by namespace name.

Supported names currently match the package folders:

  • lsh
  • simhash
  • pcahash
  • itq

build(embeddings, item_ids=None)

Build index-specific state from an embedding table.

  • embeddings must be a torch.Tensor with shape (num_items, dim)
  • item_ids is optional; if omitted, row indices are converted to strings

If normalize_inputs=True, embeddings are L2-normalized before indexing.

query(queries, top_k=10)

Query with one vector or a query matrix.

  • accepted shapes:
  • (dim,)
  • (num_queries, dim)
  • returns a list of IndexerQueryResult

Each result contains:

  • item_ids
  • indices
  • scores

save_pretrained(path) and from_pretrained(path)

Every indexer can be serialized with the same checkpoint-style interface used by autoencoders.

Saved directories contain:

  • config.json
  • sample_spec.json
  • index.pt

Algorithm Notes

LSH

LocalitySensitiveHashingIndexer implements multi-table random-hyperplane LSH.

Key config fields:

  • num_bits
  • num_tables
  • projection_distribution: gaussian or rademacher
  • seed
  • normalize_inputs

Use lsh when you want classic bucket-based approximate retrieval with several independent tables.

SimHash

SimHashIndexer uses one random-hyperplane binary code per item.

Key config fields:

  • num_bits
  • projection_distribution
  • seed
  • normalize_inputs

Use simhash for a simple binary hashing baseline over dense embeddings.

PCA Hashing

PrincipalComponentHashingIndexer projects embeddings onto principal components and thresholds each bit.

Key config fields:

  • num_bits
  • use_median_thresholds
  • normalize_inputs

Use pcahash when you want a deterministic linear hashing baseline that adapts to the embedding covariance structure.

ITQ

IterativeQuantizationIndexer performs PCA followed by orthogonal rotation refinement.

Key config fields:

  • num_bits
  • num_iterations
  • normalize_inputs

Use itq when you want a stronger classical binary hashing baseline than raw random projection.

DataSpec Requirements

Current autoindexers implementations accept only one-dimensional tensor samples:

TensorSpec(shape=(dim,))

This keeps them aligned with full embedding-table workflows. Pairwise, multimodal, or structured-table indexers can be added later without changing the current simple contract.