starting with diffuse noise and iterating until the image matches the text prompt, you are essentially running the neural net ...
While image diffusion models add continuous noise to pixel values, text diffusion models can't apply continuous noise to discrete tokens (chunks of text data). Instead, they replace tokens with ...