CLIP-fine-tune
CLIP-fine-tune copied to clipboard
Request for Citation Format for Research Use
Hi @zer0int ,
Thank you for providing the open-source Long-CLIP model—we've found it very useful for our research.
We are currently using your model in our research work and would like to properly cite your contribution. Could you please provide your preferred citation format or BibTeX entry? Below is a template!
@misc{zer0int2024clipgmp,
author = {zer0int},
title = {CLIP-GmP-ViT-L-14: Fine-tuned CLIP Model},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {https://github.com/zer0int/CLIP-fine-tune},
note = {Accessed: 2025-04-10}
}
Thanks again for your valuable contribution!
Hi @whats2000,
thank you for your interest - I'm happy to hear you find the code useful!
Your proposed citation looks good to me, feel free to use as-is.
However, I just want to emphasize that the original Long-CLIP implementation (i.e. expanding from 77 -> 248 tokens) is by github.com/beichenzbc/Long-CLIP - my code just adds Geometric Parametrization and a custom loss function etc. for fine-tuning, building on the original author's work.
In any case, I wish you maximum success with your research - kind regards!
Thank you for your response! I’ll definitely make sure to cite the original LongCLIP paper.
By the way, we’ve come across several interesting results beyond the diffusion task—perhaps you might consider writing a paper to share those findings, if you have the time!
I'm curious, what are your results beyond diffusion tasks / using the Text Encoder? I'd be keen to hear more about that, if you're able to share!
I am aware of one research application in which my CLIP-GmP-ViT-L/14 model was used for visual feature extraction; used in an AI system to automatically classify GPS interference signals:
Our task involves zero-shot retrieval across multiple modalities, and we're currently working on a paper in this area.
Also, your https://huggingface.co/zer0int/LongCLIP-Registers-Gated_MLP-ViT-L-14 caught our eyes!
We're considering fine-tuning the G-14 and B-32 variants of LongCLIP, and your code could be incredibly helpful.
Also, would you mind sharing your name? Most papers prefer citing real names rather than usernames.
In that case, you might be interested in also experimenting with my most recent CLIP model, which has MLP gates in the Vision Transformer. The zero-shot (ImageNet/ObjectNet) accuracy is slightly lower than the model without gating (for the 77-token CLIP, it is as follows: OpenAI: ~85% < my gated MLP ~88% < my 'normal' GmP ~91%). However, the modality gap is much lower with MLP gates, which might be beneficial for retrieval:
Modality Gap (Euclidean): Original Long-CLIP-L modality gap: 1.07 Modality Gap Long-CLIP-L-GmP (my): ~0.8 LongCLIP-Registers-Gated (my): 0.58
Git Repo / Code -- HuggingFace model
It uses the same approach of Geometric Parametrization etc. - it just has an additional significant modification of the model architecture (and about +20M parameters in ViT-L/14)
As with regard to sharing my real name, I'd rather not. AI models like CLIP are inherently 'dual use' - from beneficial research use and cyber-defense, to malicious use such as non-consentual 'deep fakes'.
With my name being public, even if only shared once via your research paper, I could easily be linked to the model in any case - and I'd then be at the receiving end of potential moral outrage in case somebody at some point uses my model for nefarious goals. I have no 'corporate backup' -- I'm just an indie dev doing this in my free time, so I'd be on my own against an anonymous mob. That would be a situation I'd rather avoid - thank you for your understanding!
Kind regards!
Edit: It seems you edited your response while I had this tab open, and you're already aware of my other model. Sorry about the redundancy! :)
Ok, I understand that! Then, I will use your user ID as a citation. Thanks for your work!