Indexers¶

autoindexers is a sibling package that ships inside the same distribution as autoencoders.

Use it when you already have a full embedding table and want a classical indexing or hashing backend without training another neural model.

What It Covers¶

Current built-in indexers are:

lsh: random-hyperplane locality-sensitive hashing with multiple hash tables
simhash: single-table random-hyperplane binary hashing
pcahash: principal-component hashing with optional median thresholds
itq: iterative quantization over PCA-projected embeddings

All current indexers:

consume a full embedding table with shape (num_items, dim)
require sample_spec=TensorSpec(shape=(dim,))
can build(), query(), save_pretrained(), and from_pretrained()
store their state on CPU for portable serialization

Relationship To `autoencoders`¶

The two packages intentionally solve different problems:

autoencoders learns latent representations with trainable neural models
autoindexers builds classical retrieval backends over a finished embedding table

Typical flows look like:

raw inputs -> autoencoders model -> learned embeddings/latents -> autoindexers backend

or:

existing embedding table -> autoindexers backend

Quick Start¶

import torch

from autoencoders.data.base import TensorSpec
from autoindexers import load_indexer

embeddings = torch.randn(10000, 256)
item_ids = [f"item-{index}" for index in range(embeddings.shape[0])]

indexer = load_indexer(
    "lsh",
    sample_spec=TensorSpec(shape=(256,)),
    num_bits=24,
    num_tables=6,
).build(embeddings, item_ids=item_ids)

results = indexer.query(embeddings[0], top_k=5)
print(results[0].item_ids)
print(results[0].scores)

Shared API¶

`load_indexer(name, sample_spec=..., **kwargs)`¶

Create an indexer by namespace name.

Supported names currently match the package folders:

lsh
simhash
pcahash
itq

`build(embeddings, item_ids=None)`¶

Build index-specific state from an embedding table.

embeddings must be a torch.Tensor with shape (num_items, dim)
item_ids is optional; if omitted, row indices are converted to strings

If normalize_inputs=True, embeddings are L2-normalized before indexing.

`query(queries, top_k=10)`¶

Query with one vector or a query matrix.

accepted shapes:
(dim,)
(num_queries, dim)
returns a list of IndexerQueryResult

Each result contains:

item_ids
indices
scores

`save_pretrained(path)` and `from_pretrained(path)`¶

Every indexer can be serialized with the same checkpoint-style interface used by autoencoders.

Saved directories contain:

config.json
sample_spec.json
index.pt

Algorithm Notes¶

LSH¶

LocalitySensitiveHashingIndexer implements multi-table random-hyperplane LSH.

Key config fields:

num_bits
num_tables
projection_distribution: gaussian or rademacher
seed
normalize_inputs

Use lsh when you want classic bucket-based approximate retrieval with several independent tables.

SimHash¶

SimHashIndexer uses one random-hyperplane binary code per item.

Key config fields:

num_bits
projection_distribution
seed
normalize_inputs

Use simhash for a simple binary hashing baseline over dense embeddings.

PCA Hashing¶

PrincipalComponentHashingIndexer projects embeddings onto principal components and thresholds each bit.

Key config fields:

num_bits
use_median_thresholds
normalize_inputs

Use pcahash when you want a deterministic linear hashing baseline that adapts to the embedding covariance structure.

ITQ¶

IterativeQuantizationIndexer performs PCA followed by orthogonal rotation refinement.

Key config fields:

num_bits
num_iterations
normalize_inputs

Use itq when you want a stronger classical binary hashing baseline than raw random projection.

DataSpec Requirements¶

Current autoindexers implementations accept only one-dimensional tensor samples:

TensorSpec(shape=(dim,))

This keeps them aligned with full embedding-table workflows. Pairwise, multimodal, or structured-table indexers can be added later without changing the current simple contract.

Indexers¶

What It Covers¶

Relationship To autoencoders¶

Quick Start¶

Shared API¶

load_indexer(name, sample_spec=..., **kwargs)¶

build(embeddings, item_ids=None)¶

query(queries, top_k=10)¶

save_pretrained(path) and from_pretrained(path)¶

Algorithm Notes¶

LSH¶

SimHash¶

PCA Hashing¶

ITQ¶

DataSpec Requirements¶

Relationship To `autoencoders`¶

`load_indexer(name, sample_spec=..., **kwargs)`¶

`build(embeddings, item_ids=None)`¶

`query(queries, top_k=10)`¶

`save_pretrained(path)` and `from_pretrained(path)`¶