Uniformity on the Sphere. Our Sphere Encoder maps the natural image distribution uniformly onto a global sphere. The decoder then generates an image by decoding a point on the sphere. Shown here for three random classes of CIFAR-10, latents from CIFAR-10 training samples are projected into 3D via a random Gaussian matrix and normalized to unit length. The distribution reveals a highly uniform coverage of the sphere within each class, a trend consistent across various datasets, such as ImageNet, Animal-Faces, and Oxford-Flowers. This uniformity is valid for both conditional and unconditional models.
We introduce the Sphere Encoder, an efficient generative framework capable of producing images in a single forward pass and competing with many-step diffusion models using fewer than five steps. Our approach works by learning an encoder that maps natural images uniformly onto a spherical latent space, and a decoder that maps random latent vectors back to the image space. Trained solely through image reconstruction losses, the model generates an image by simply decoding a random point on the sphere. Our architecture naturally supports conditional generation, and looping the encoder/decoder a few times can further enhance image quality. Across several datasets, the sphere encoder approach yields performance competitive with state of the art diffusions, but with a small fraction of the inference cost.
ImageNet (256x256, 4-step generation)
ImageNet (256x256, 4-step generation)
Animal-Faces (256x256, 1-step generation)
Oxford-Flowers (256x256, 2-step generation)
CIFAR-10 (32x32, 1-step generation)
Sphere Encoder, trained entirely from scratch, can generate sharp and high-fidelity images in within 4 steps.
Spherifying the latent space with noise. Encoder E maps image x to a latent, which f projects to v on sphere S. During training, random Gaussian noise σ·e is added to v, where σ is jittered magnitude. Decoder D reconstructs the image from the re-projected noisy latent f(v + σ·e).
Swipe to compare different models against the "Posterior Hole" problem.
Columns show:
(1) Input images;
(2) Autoencoder reconstructions;
(3) Samples from standard Gaussian prior;
and (4) Samples from estimated Gaussian posterior on the training set of Animal-Faces.
Variational Autoencoders (VAEs) face a fundamental trade-off: the divergence loss (matching a Gaussian prior) and the reconstruction loss are often at odds. Minimizing one typically degrades the other, leading to "posterior holes"—regions in the latent space that do not map to valid images.
Attempt to force latents into a Gaussian distribution. This creates a conflict where the learned posterior fails to match the prior, making direct sampling unreliable.
Forces latents onto a uniform spherical manifold. By spreading embeddings away from each other on a bounded sphere, we achieve uniformity without sacrificing reconstruction accuracy.
Sphere encoder enables versatile image editing capabilities across various scenarios, from OOD transformations to composite harmonization. A key benefit of this approach is that the entire editing process is training-free, allowing for high-quality manipulation without the need for additional fine-tuning or task-specific optimization.
Given an image far outside of the training distribution, we repeatedly encode and decode an input image, conditioning on different ImageNet classes.
We observe that a single step captures the primary object from the input while adapting its texture to match the target class. By increasing the iterations (e.g., 4-step generation), the model further refines the object's texture and key characteristics to align with the target class—all while maintaining the structural integrity of the original image.
We further demonstrate the model's editing capabilities by manually stitching together two distinct sources, Image A and Image B. By repeatedly encoding and decoding this stitched composite, the model naturally smooths the boundaries and harmonizes the content.
The process forces the manipulated image to converge to a valid point on the learned spherical manifold. Notably, unlike diffusion models (which require noise injection to hallucinate details), our encoder directly projects the stitched image into the latent space without adding noise, preserving the semantic integrity of both original images while creating a seamless transition.
We would like to thank Tan Wang, Chenyang Zhang, Tian Xie, Wei Liu, Felix Juefei Xu, and Andrej Risteski for valuable discussion and feedback.
@article{kai2026sphere,
title = {Image Generation with a Sphere Encoder},
author = {Yue, Kaiyu and Jia, Menglin and Hou, Ji and Goldstein, Tom},
journal= {arXiv preprint arXiv:2602.15030},
year = {2026}
}