Stable Diffusion (SD) is a text-to-image model developed by Stability AI / CompVis group.

Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder.

  • User provides text description
  • CLIP text encoder converts text description into meaningful vector representation
    • Contrastive Language-Image Pre-training is a model developed by OpenAI
    • Encoder converts text vector rep (captures semantic meaning)
  • Draw the rest of the owl