jLuger.de - Stable Diffusion on m1 mac

First things first. If you just want to have an app to generate images from text and you have an Apple silicon M series device you better check out Diffusers or DiffusionBee. This article is for the usage through python code to get more control about the steps done and the models used. To start you need to install some dependencies via pip:

pip install diffusers

The following code is taken directly from https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion_2

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch

repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")

The last line uses half precision which is only available on Nvidia GPUs. So you need to change it to

pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float32)

The code goes on with

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

The last line requires a Nvidia or AMD GPU (with rocm installed). You can omit it to use the CPU but on Apple silicon devices you can change cuda to mps to use the GPU.

pipe = pipe.to("mps")

The example then finishes with

prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]
image.save("astronaut.png")

To get more parameters for the pipe call in the second last line see https://huggingface.co/docs/diffusers/v0.15.0/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline.__call__
Unfortunately the resolution is not in the parameter list. This seems to be part of the model. To generate 768x768 pixel images you have to switch the repo_id to stable-diffusion-2. The example above uses 512x512.
If you want to use other models from huggingface.co just change the repo_id.

Please note that models that generate larger images need much more time to generate an image. As you need several tries until you get the image you want, this extra time sums up. To get around this you can use upscalers. One is a x2 upscaler that can turn 512x512 to 1024x1024 and another one is a x4 upscaler which can turn 512x512 in 2048x2048. Of course the x4 needs much more resources and such I have only tested the x2.

Here is some sample code that will generate a image and scale it up:

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch
repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float32, revision="fp16")
prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png")
from diffusers import StableDiffusionLatentUpscalePipeline, StableDiffusionPipeline
upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained("stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float32)
upscaled_image = upscaler( prompt=prompt, image=image, num_inference_steps=20).images[0]
upscaled_image.save("astronaut_up.png")

I have also tried upsaling the upsacled_image. While the inferencing steps were quite fast I got an out of memory error on any computer with less than 64GB memory and the post processing (after the inferencing steps) took very long.

While speaking about performance. My MacBook M1 beats the CPU image generation by 40 seconds vs. 2 minutes. This is great but 40 seconds can still get very long. So when I've learned that starting by macOS 13.1 you can convert the models to use the Neural Enginge I wanted to give it a try.

I've cloned the code from https://github.com/apple/ml-stable-diffusion. In the Readme it states that you have to download the models. The good news is that if you have executed the previous python code you already have the models.
See ~/.cache/huggingface/hub. For the example above the model was in models--stabilityai--stable-diffusion-2-base. There is a snapshots directory containing directories with 40 hexedecimal characters. If you have run the exmaple only once there will be only one directory. In my case it was ~/.cache/huggingface/hub/models--stabilityai--stable-diffusion-2-base/snapshots/1cb61502fc8b634cdb04e7cd69e06051a728bedf.

So my call to convert the models was

python -m python_coreml_stable_diffusion.torch2coreml --convert-unet 
--convert-text-encoder --convert-vae-decoder --convert-safety-checker -o
 ../conv_output --bundle-resources-for-swift-cli
--model-version ~/.cache/huggingface/hub/models--stabilityai--stable-diffusion-2-base/snapshots/1cb61502fc8b634cdb04e7cd69e06051a728bedf

Please note the parameter --bundle-resources-for-swift-cli. This is necessary to use the image generation with Swift. You want to use swift as the python code tooks a very long time to load a model on each model loading while the swift tool only does this for the very first one. Later on the image is loaded very fast und the image generation drops to about 25 seconds.
So using the Neural Enginge gave me a good boost.

Here is a sample to run the swift tool:

swift run StableDiffusionSample "High quality photo of an astronaut riding a horse in space" --resource-path ../conv_output/Resources/
--step-count 35 --disable-safety --image-count 4 --output-path ../generated_images

A final tip: DiffusionPipeline.from_pretrained always connects to huggingface.co to check for updates. If you don't want this you can set the local path as repo_id. See this example:

from os.path import expanduser
repo_id = expanduser("~")+"/.cache/huggingface/hub/models--stabilityai--stable-diffusion-2-base/snapshots/1cb61502fc8b634cdb04e7cd69e06051a728bedf"