Image to image - Support for for other sizes/dimensions than 512 x 512
From what I understand, we currently can only generate 512 x 512 images?
On a model that was trained for 704x960, we still need to be constrained by 512 x 512, or can we generate 704x960?
Thanks!
The only option right now is to hardcore the resolution during conversion, and it only works with the ORIGINAL implementation. To do so, you can use the --latent-w <size> and --latent-h <size> flags. For example:
python -m python_coreml_stable_diffusion.torch2coreml --latent-w 64 --latent-h 96 --compute-unit CPU_AND_GPU --convert-vae-decoder --convert-vae-encoder --convert-unet --convert-text-encoder --model-version <model-name>_diffusers --bundle-resources-for-swift-cli --attention-implementation ORIGINAL -o <model-name>_original_512x768
You have to choose a resolution that is divisible by 64. Also, you have to specify it divided by 8 (e.g. 768/8=96).
In the example above, the model will always output images at a resolution of 512x768.
Thanks @Zabriskije! Would this also work for image to image?
I don't have direct experience with that, but from what I know, it's not supported yet.
Thanks again @Zabriskije , will try to convert a custom size model and run inference with image to image to test. Will post results here if I can make it
@Zabriskije I tried image to image using this model: https://huggingface.co/coreml/coreml-Grapefruit/blob/main/original/512x768/grapefruit41_original_512x768.zip
but when I pass a 512X768 image as starting input, I got error from this line:
https://github.com/apple/ml-stable-diffusion/blob/2c4e9de73c9e723de264356f9563706ea9104212/swift/StableDiffusion/pipeline/Encoder.swift#L89
It turns out that the shape of Encoder accepted is: [1, 3, 768, 512] but the shape of image is [1, 3, 512, 768].
And the Model description of VAEEncoder.modelc:
[
{
"shortDescription" : "Stable Diffusion generates images conditioned on text and\/or other images as input through the diffusion process. Please refer to https:\/\/arxiv.org\/abs\/2112.10752 for details.",
"metadataOutputVersion" : "3.0",
"outputSchema" : [
{
"hasShapeFlexibility" : "0",
"isOptional" : "0",
"dataType" : "Float32",
"formattedType" : "MultiArray (Float32)",
"shortDescription" : "The latent embeddings from the unet model from the input image.",
"shape" : "[]",
"name" : "latent_dist",
"type" : "MultiArray"
}
],
"version" : ".\/diffusers",
"modelParameters" : [
],
"author" : "Please refer to the Model Card available at huggingface.co\/.\/diffusers",
"specificationVersion" : 7,
"storagePrecision" : "Float16",
"license" : "OpenRAIL (https:\/\/huggingface.co\/spaces\/CompVis\/stable-diffusion-license)",
"mlProgramOperationTypeHistogram" : {
"Transpose" : 7,
"Ios16.exp" : 1,
"Ios16.reduceMean" : 44,
"Ios16.softmax" : 1,
"Split" : 1,
"Ios16.linear" : 4,
"Ios16.add" : 35,
"Ios16.realDiv" : 22,
"Ios16.square" : 22,
"Pad" : 3,
"Ios16.sub" : 22,
"Ios16.cast" : 1,
"Ios16.clip" : 1,
"Ios16.conv" : 28,
"Ios16.matmul" : 2,
"Ios16.reshape" : 54,
"Ios16.batchNorm" : 22,
"Ios16.silu" : 21,
"Ios16.sqrt" : 22,
"Ios16.mul" : 6
},
"computePrecision" : "Mixed (Float32, Float16, Int32)",
"isUpdatable" : "0",
"availability" : {
"macOS" : "13.0",
"tvOS" : "16.0",
"watchOS" : "9.0",
"iOS" : "16.0",
"macCatalyst" : "16.0"
},
"modelType" : {
"name" : "MLModelType_mlProgram"
},
"inputSchema" : [
{
"hasShapeFlexibility" : "0",
"isOptional" : "0",
"dataType" : "Float16",
"formattedType" : "MultiArray (Float16 1 × 3 × 768 × 512)",
"shortDescription" : "An image of the correct size to create the latent space with, image2image and in-painting.",
"shape" : "[1, 3, 768, 512]",
"name" : "sample",
"type" : "MultiArray"
},
{
"hasShapeFlexibility" : "0",
"isOptional" : "0",
"dataType" : "Float16",
"formattedType" : "MultiArray (Float16 1 × 4 × 96 × 64)",
"shortDescription" : "Latent noise for `DiagonalGaussianDistribution` operation.",
"shape" : "[1, 4, 96, 64]",
"name" : "diagonal_noise",
"type" : "MultiArray"
},
{
"hasShapeFlexibility" : "0",
"isOptional" : "0",
"dataType" : "Float16",
"formattedType" : "MultiArray (Float16 1 × 4 × 96 × 64)",
"shortDescription" : "Latent noise for use with strength parameter of image2image",
"shape" : "[1, 4, 96, 64]",
"name" : "noise",
"type" : "MultiArray"
},
{
"hasShapeFlexibility" : "0",
"isOptional" : "0",
"dataType" : "Float16",
"formattedType" : "MultiArray (Float16 1 × 1)",
"shortDescription" : "Precalculated `sqrt_alphas_cumprod` value based on strength and the current schedular's alphasCumprod values",
"shape" : "[1, 1]",
"name" : "sqrt_alphas_cumprod",
"type" : "MultiArray"
},
{
"hasShapeFlexibility" : "0",
"isOptional" : "0",
"dataType" : "Float16",
"formattedType" : "MultiArray (Float16 1 × 1)",
"shortDescription" : "Precalculated `sqrt_one_minus_alphas_cumprod` value based on strength and the current schedular's alphasCumprod values",
"shape" : "[1, 1]",
"name" : "sqrt_one_minus_alphas_cumprod",
"type" : "MultiArray"
}
],
"userDefinedMetadata" : {
"com.github.apple.coremltools.version" : "6.2",
"com.github.apple.coremltools.source" : "torch==1.13.1"
},
"generatedClassName" : "Stable_Diffusion_version___diffusers_vae_encoder",
"method" : "predict"
}
]
"formattedType" : "MultiArray (Float16 1 × 3 × 768 × 512)",
@jo32 Looks like width and height are switched In the json files
@Zabriskije I am not sure whether it is a problem of model converted or a problem of code in current repo ml-stable-diffusion. Because the model works fine in text to image mode.
@jo32 I think it's not even a problem, they just put height first. It's a bit strange reading it when you talk about resolution since normally width comes first (e.g. 1920x1080), but it may have a "code sense" that I'm unaware of. Since text-to-image works fine, the resolution is correctly called from the JSON file. The only problem, in this case, would be if the resolution gets called invertedly while using image-to-image.
@Zabriskije Agreed.
I have tested several models, for example: for txt2txt, the size is 512768; however, in img2img mode, need to submit start image with a size of 768512. This is likely a bug.
Thanks @Zabriskije