Datasets¶
For dataset config parameter meanings, see Configuration Reference.
Current Dataset Surface¶
The repository now ships with a dataset layer that mirrors the model/module architecture:
autoencoders/data/base.py: base contracts, caching, deterministic splits, andDataSpecautoencoders/data/glove.py: downloadable GloVe embeddingsautoencoders/data/fasttext.py: official fastText English vectorsautoencoders/data/numberbatch.py: ConceptNet Numberbatch vectorsautoencoders/data/text.py: shared infrastructure for encoder-backed text datasetsautoencoders/data/snli.py: SNLI embeddingsautoencoders/data/multinli.py: MultiNLI embeddingsautoencoders/data/clip.py: shared CLIP-backed multimodal infrastructureautoencoders/data/flickr30k.py: Flickr30k CLIP embeddingsautoencoders/data/cifar10.py: CIFAR-10 image tensors for CNN- and ViT-backed experiments
Recommended Starting Points¶
- For embedding-first deterministic and variational experiments:
glove.6B.50d - For stronger word-level coverage:
fasttext - For semantically enriched embeddings:
numberbatch - For sentence-level latent experiments:
snliormultinli - For image-text representation experiments:
flickr30k - For image-backed quantized or transformer experiments:
cifar10
Python Usage¶
Load a dataset directly:
from autoencoders.data import load_dataset
dataset = load_dataset("glove", dim=50, max_vectors=50000)
loaders = dataset.get_dataloaders(batch_size=256)
print(dataset.get_sample_spec()) # TensorSpec(shape=(50,))
Sentence datasets materialize embeddings through a configured encoder:
dataset = load_dataset(
"snli",
encoder_name="sentence-transformers/all-MiniLM-L6-v2",
max_examples=50000,
)
print(dataset.get_sample_spec())
Image datasets expose H x W x C specs:
dataset = load_dataset("cifar10", max_examples=10000)
print(dataset.get_sample_spec()) # TensorSpec(shape=(32, 32, 3))
YAML Training Flow¶
Training now goes through one YAML-first entrypoint:
python examples/trainer.py --config examples/configs/glove/ae.yaml --epoch 5
python examples/trainer.py --config examples/configs/glove/vae.yaml --epoch 5
python examples/trainer.py --config examples/configs/glove/vqvae.yaml --epoch 5
python examples/trainer.py --config examples/configs/cifar10/vqvae.yaml --epoch 5
python examples/trainer.py --config examples/configs/cifar10/vqvae_vit.yaml --epoch 5
Each config is structured as:
dataset.namedataset.configmodel.namemodel.configencoder.nameencoder.configdecoder.namedecoder.configtrainer
decoder may be null, but only when reversing the encoder yields a decoder whose runtime input spec matches the model's decoder input spec. Hierarchical and latent-shape-changing models should declare an explicit decoder.
Caching¶
Downloaded datasets are cached under:
~/.cache/autoencoders
Override globally with:
export AUTOENCODERS_CACHE=/your/cache/path
The first preparation step downloads raw assets, converts them into torch-friendly artifacts, and reuses those artifacts on later runs.
Notes¶
- Keep
max_vectorsormax_examplesmodest for quick smoke tests. - Encoder-backed text datasets usually take longer on the first run because they materialize embeddings before training starts.
cifar10now retries mirrored downloads and validates cached archives before extraction, which makes interrupted downloads easier to recover from.- Explicit image decoders should set
transpose: truewhen they are intended to upsample back to image space.