Indexers¶
autoindexers is a sibling package that ships inside the same distribution as autoencoders.
Use it when you already have a full embedding table and want a classical indexing or hashing backend without training another neural model.
What It Covers¶
Current built-in indexers are:
lsh: random-hyperplane locality-sensitive hashing with multiple hash tablessimhash: single-table random-hyperplane binary hashingpcahash: principal-component hashing with optional median thresholdsitq: iterative quantization over PCA-projected embeddings
All current indexers:
- consume a full embedding table with shape
(num_items, dim) - require
sample_spec=TensorSpec(shape=(dim,)) - can
build(),query(),save_pretrained(), andfrom_pretrained() - store their state on CPU for portable serialization
Relationship To autoencoders¶
The two packages intentionally solve different problems:
autoencoderslearns latent representations with trainable neural modelsautoindexersbuilds classical retrieval backends over a finished embedding table
Typical flows look like:
raw inputs -> autoencoders model -> learned embeddings/latents -> autoindexers backend
or:
existing embedding table -> autoindexers backend
Quick Start¶
import torch
from autoencoders.data.base import TensorSpec
from autoindexers import load_indexer
embeddings = torch.randn(10000, 256)
item_ids = [f"item-{index}" for index in range(embeddings.shape[0])]
indexer = load_indexer(
"lsh",
sample_spec=TensorSpec(shape=(256,)),
num_bits=24,
num_tables=6,
).build(embeddings, item_ids=item_ids)
results = indexer.query(embeddings[0], top_k=5)
print(results[0].item_ids)
print(results[0].scores)
Shared API¶
load_indexer(name, sample_spec=..., **kwargs)¶
Create an indexer by namespace name.
Supported names currently match the package folders:
lshsimhashpcahashitq
build(embeddings, item_ids=None)¶
Build index-specific state from an embedding table.
embeddingsmust be atorch.Tensorwith shape(num_items, dim)item_idsis optional; if omitted, row indices are converted to strings
If normalize_inputs=True, embeddings are L2-normalized before indexing.
query(queries, top_k=10)¶
Query with one vector or a query matrix.
- accepted shapes:
(dim,)(num_queries, dim)- returns a list of
IndexerQueryResult
Each result contains:
item_idsindicesscores
save_pretrained(path) and from_pretrained(path)¶
Every indexer can be serialized with the same checkpoint-style interface used by autoencoders.
Saved directories contain:
config.jsonsample_spec.jsonindex.pt
Algorithm Notes¶
LSH¶
LocalitySensitiveHashingIndexer implements multi-table random-hyperplane LSH.
Key config fields:
num_bitsnum_tablesprojection_distribution:gaussianorrademacherseednormalize_inputs
Use lsh when you want classic bucket-based approximate retrieval with several independent tables.
SimHash¶
SimHashIndexer uses one random-hyperplane binary code per item.
Key config fields:
num_bitsprojection_distributionseednormalize_inputs
Use simhash for a simple binary hashing baseline over dense embeddings.
PCA Hashing¶
PrincipalComponentHashingIndexer projects embeddings onto principal components and thresholds each bit.
Key config fields:
num_bitsuse_median_thresholdsnormalize_inputs
Use pcahash when you want a deterministic linear hashing baseline that adapts to the embedding covariance structure.
ITQ¶
IterativeQuantizationIndexer performs PCA followed by orthogonal rotation refinement.
Key config fields:
num_bitsnum_iterationsnormalize_inputs
Use itq when you want a stronger classical binary hashing baseline than raw random projection.
DataSpec Requirements¶
Current autoindexers implementations accept only one-dimensional tensor samples:
TensorSpec(shape=(dim,))
This keeps them aligned with full embedding-table workflows. Pairwise, multimodal, or structured-table indexers can be added later without changing the current simple contract.