Stable Diffusion (SD) is a text-to-image model developed by Stability AI / CompVis group.
Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.
- User provides text description
- CLIP text encoder converts text description into meaningful vector representation
- Contrastive Language-Image Pre-training is a model developed by OpenAI
- Encoder converts text ⇒ vector rep (captures semantic meaning)
- Draw the rest of the owl