Skip to content

Datasets

For dataset config parameter meanings, see Configuration Reference.

Current Dataset Surface

The repository now ships with a dataset layer that mirrors the model/module architecture:

  • autoencoders/data/base.py: base contracts, caching, deterministic splits, and DataSpec
  • autoencoders/data/glove.py: downloadable GloVe embeddings
  • autoencoders/data/fasttext.py: official fastText English vectors
  • autoencoders/data/numberbatch.py: ConceptNet Numberbatch vectors
  • autoencoders/data/text.py: shared infrastructure for encoder-backed text datasets
  • autoencoders/data/snli.py: SNLI embeddings
  • autoencoders/data/multinli.py: MultiNLI embeddings
  • autoencoders/data/clip.py: shared CLIP-backed multimodal infrastructure
  • autoencoders/data/flickr30k.py: Flickr30k CLIP embeddings
  • autoencoders/data/cifar10.py: CIFAR-10 image tensors for CNN- and ViT-backed experiments
  • For embedding-first deterministic and variational experiments: glove.6B.50d
  • For stronger word-level coverage: fasttext
  • For semantically enriched embeddings: numberbatch
  • For sentence-level latent experiments: snli or multinli
  • For image-text representation experiments: flickr30k
  • For image-backed quantized or transformer experiments: cifar10

Python Usage

Load a dataset directly:

from autoencoders.data import load_dataset

dataset = load_dataset("glove", dim=50, max_vectors=50000)
loaders = dataset.get_dataloaders(batch_size=256)
print(dataset.get_sample_spec())  # TensorSpec(shape=(50,))

Sentence datasets materialize embeddings through a configured encoder:

dataset = load_dataset(
    "snli",
    encoder_name="sentence-transformers/all-MiniLM-L6-v2",
    max_examples=50000,
)
print(dataset.get_sample_spec())

Image datasets expose H x W x C specs:

dataset = load_dataset("cifar10", max_examples=10000)
print(dataset.get_sample_spec())  # TensorSpec(shape=(32, 32, 3))

YAML Training Flow

Training now goes through one YAML-first entrypoint:

python examples/trainer.py --config examples/configs/glove/ae.yaml --epoch 5
python examples/trainer.py --config examples/configs/glove/vae.yaml --epoch 5
python examples/trainer.py --config examples/configs/glove/vqvae.yaml --epoch 5
python examples/trainer.py --config examples/configs/cifar10/vqvae.yaml --epoch 5
python examples/trainer.py --config examples/configs/cifar10/vqvae_vit.yaml --epoch 5

Each config is structured as:

  • dataset.name
  • dataset.config
  • model.name
  • model.config
  • encoder.name
  • encoder.config
  • decoder.name
  • decoder.config
  • trainer

decoder may be null, but only when reversing the encoder yields a decoder whose runtime input spec matches the model's decoder input spec. Hierarchical and latent-shape-changing models should declare an explicit decoder.

Caching

Downloaded datasets are cached under:

~/.cache/autoencoders

Override globally with:

export AUTOENCODERS_CACHE=/your/cache/path

The first preparation step downloads raw assets, converts them into torch-friendly artifacts, and reuses those artifacts on later runs.

Notes

  • Keep max_vectors or max_examples modest for quick smoke tests.
  • Encoder-backed text datasets usually take longer on the first run because they materialize embeddings before training starts.
  • cifar10 now retries mirrored downloads and validates cached archives before extraction, which makes interrupted downloads easier to recover from.
  • Explicit image decoders should set transpose: true when they are intended to upsample back to image space.