Skip to content

Quantized Family

Base tree

BaseVectorQuantizedAutoencoderConfig
├── BaseAutoencoderConfig
│   ├── latent_dim
│   └── reconstruction_loss
├── codebook_size
├── commitment_weight
├── codebook_weight
├── assignment_strategy
├── sinkhorn_epsilon
├── sinkhorn_iters
├── kmeans_init
├── kmeans_iters
├── use_ema_codebook
├── ema_decay
├── ema_epsilon
└── dead_code_reset

Models

VectorQuantizedAutoencoderConfig
└── no additional model fields

GumbelQuantizedAutoencoderConfig
├── temperature
├── min_temperature
└── anneal_rate

FiniteScalarQuantizedAutoencoderConfig
└── num_levels

ResidualFiniteScalarQuantizedAutoencoderConfig
├── num_levels
└── num_quantizers

ProductQuantizedAutoencoderConfig
└── num_codebooks

OptimizedProductQuantizedAutoencoderConfig
└── no additional model fields

ResidualQuantizedAutoencoderConfig
└── num_quantizers

HierarchicalVectorQuantizedAutoencoderConfig
└── top_latent_dim

Notes

  • Quantized models depend heavily on codebook initialization, usage balance, and reconstruction-vs-quantization weighting.
  • kmeans_init: true initializes learned codebooks from the first training batch of encoder latents instead of uniform random weights.
  • assignment_strategy: sinkhorn switches learned vector-codebook quantizers from nearest-neighbor assignment to balanced Sinkhorn assignment. sinkhorn_epsilon may be a single value or a per-codebook list. Any slot set to 0.0 falls back to nearest-neighbor assignment for that codebook, which matches the original RQ-VAE semantics.
  • OPQVAE applies a learned orthogonal rotation before product quantization and inverts that rotation after quantized reconstruction. It inherits the same num_codebooks, kmeans_init, EMA, and Sinkhorn options as PQVAE.
  • Sinkhorn assignment uses float32 on Apple mps devices because that backend does not support float64. CPU and CUDA paths still use float64 working precision for extra numerical headroom.
  • Hierarchical models such as VQVAE2 have decoder spaces that differ from encoder output spaces, so they should use explicit decoders.
  • FSQ and RFSQ use fixed scalar levels rather than learned vector codebooks, so their dead-code handling semantics differ from VQVAE, PQVAE, and RQVAE.