diffusers [wip] kakaobrain unCLIP

[wip] kakaobrain unCLIP

Open williamberman opened this issue 1 year ago • 4 comments

Scoping

[x] Scope prior Transformer
[x] Scope decoder Unet
[x] Scope super resolution 64->256 Unet
[x] Scope super resolution 256 ->1024 Unet
[x] Scope scheduler
[x] Scope pipeline

Current output: Our schedulers pick different timesteps than the original implementation. I separately hardcoded our scheduler to the same timesteps and confirmed equivalent outputs. Diffusers port: out-0

Original: out_large_orig

TODO

[x] e2e verification - only decoder left, I believe I incorrectly added the mask in the attention block!
[x] add information about masking added to existing CrossAttention block
[x] if necessary, add docs on discrepancies between sample coefficients used in scheduler
[ ] docs
[ ] tests
[ ] add mask to alternative attention mechanisms in CrossAttention

scheduler/pipeline

note that this model runs a separate diffusion process for the prior, the decoder, and the super res unet. The super res unet also uses a separate unet as the "last step unet".
TODO fill in more info here :)

prior transformer

[x] new transformer class based off of our existing 2D transformer. This transformer maps over clip embeddings and so won't have the 2D components. There are a few additional parameters around textual embeddings and mapping the output to the clip embeddings dimension
[x] Write script porting weights
[x] Verify against original implementation

Decoder Unet

[x] Pass an additional flag to the down/up blocks indicating if the down/up sample should be a resnet. Currently, the {down, up}samples are {Down,Up}Sample2D's. We want to be able to use a resnet which wraps the sampling instead.
[x] Pass a flag to ResnetBlock2D to use the time embedding projection to scale and shift the norm'ed hidden states instead of just adding them together. It looks like this flag already existed but wasn't implemented yet.
[x] UnCLIPEmbeddingUtils Unet conditioning + additional conditioning embeddings added with time embeddings ->
[x] Port the attention block and split the combined conv block weights. This is ported but is giving small discrepancies (on the order of 1e-3 - 1e-4). It looks like these discrepancies do propagate to larger discrepancies when the whole unet is ran.
[x] Write script porting weights
[x] Verify against original implementation
[x] Make new {Down,Mid,Up} block types The new configuration ends up making existing blocks too hacky. We'll add new block definitions instead.

super resolution 64->256 Unet

Unconditional Unet.
Latents are upsampled (TODO how) before inputted
The super resolution unet looks like it actually wraps two separate unets and has a modified sampling function - https://github.com/kakaobrain/karlo/blob/e105e7643c4e9f30b1b17c7e4354d8474455dcb3/karlo/modules/diffusion/gaussian_diffusion.py#L596 see the model_aux argument
Does not contain any attention mechanism (including self attention)
[x] New block types for modified resnet up/down sample Similar to Decoder unet.
[x] Modify porting code from decoder unet Unet is basically the same structure as decoder except there's no cross or self attention mechanism. Will re-use methods from decoder unet.
[x] Verify against original implementation
[x] Port and verify "last step unet"

super resolution 256 ->1024 Unet

not released with this model!

Nov 25 '22 21:11 williamberman

The documentation is not available anymore as the PR was closed or merged.

Nov 25 '22 21:11 HuggingFaceDocBuilderDev

Very nice progress!

This all looks good to me except for the unet blocks - we should add a new block here.

Note that the blocks is the part of the modeling code where it's 100% fine to just add a new type - we don't want to add such parameters:
        up_down_sample_type: str = "default",
        attention_block="Transformer2DModel",
to the UNet2DConditionModel config - for "type" like paramaters we realy should use "down_block_types" , "up_block_types" and "mid_block_type"

Sounds good! Updated TODO list in PR description

Dec 05 '22 19:12 williamberman

Let me know if you need help with anything :-)

Dec 07 '22 14:12 patrickvonplaten

Let me know if you need help with anything :-)

Some help with the prior transformer would be great! cc @patil-suraj I think you said you might be able to help there

Dec 07 '22 17:12 williamberman

diffusers diffusers copied to clipboard

[wip] kakaobrain unCLIP

Scoping

TODO

scheduler/pipeline

prior transformer

Decoder Unet

super resolution 64->256 Unet

super resolution 256 ->1024 Unet

diffusers
diffusers copied to clipboard