model
model copied to clipboard
Add a linear layer to squeeze all patch embeddings into a single image embedding?
The current bottleneck of the Unet architecture are patch embeddings, for each section of the image. When we create the embedding of the image, we use an average of all patch embeddgins.
However, this approach is very lossy an smooth version of the image semantics, specially when trying to reconstruct an image from its embedding (location and time). Moreover, each patch embedding does not include the semantics of the patch, rather is trained to capture the self-attention weighted semantics of all other available patches within the image. This also makes the collection of patch embeddings highly redundant.
Can we Introduce one more feedforward layer to aggregate all patch embeddings, location, and time data into a single, image-wide embedding? (one on the encoder to go down from patch semantics into image semantics, and one the decoder to expand from image semantics into patch semantics)
This embedding would encapsulate the entire semantic context of the image at a specific location and time. It would also allow us to reconstruct the image from the image embedding. I suspect it will also need to convey where in the image the semantics are, so the decoder can place those differences within the image correctly. This is also highly desirable for downstream tasks that need to locate inter-image semantics.