HiSultryMan issues

Results 2 issues of


                                            HiSultryMan

In your demo code, dim of q is 64 while dim of RotaryEmbedding is 32. I checked the code, q with position index larger than 32 will not be rotate...

Can we just use text as input to enforce the joint learning of image appearance, spatial relationship, and geometry in a unified network?