Nice work! Few questions...
Can I use your method without providing a text prompt? My requirement is to simply segment different parts of a 3D object, without the need for labeling each segmented part. So, can I eliminate the requirement of passing any part names to the network? Additionally, do you think it's possible to achieve this without any fine-tuning, considering that the correspondence between part names and segments is not required?
Yes, I think using SAM's everything function is a good choice for your requirement. What you can do is to change the masks generated by GLIP+SAM to SAM everything output. However, the result is highly dependent on the granularity of SAM and I think it's somehow uncontrollable.