CLIP-featurevis icon indicating copy to clipboard operation
CLIP-featurevis copied to clipboard

text feature visualization

Open betterze opened this issue 3 years ago • 4 comments

Dear OpenAI team,

Thank you for sharing this great implementation, I really kile it.

In the section 'Understanding Language', you mention about:

If we fix a neuron on the vision side, we can search for the text that maximizes the logit. We do this with a hill climbing algorithm to find what amounts to the text maximally corresponding to that neuron.

Would you mind specify how you do this? You use a dataset of sentences, or optimize the senences using pretrained language model? The vision model only connects to language model in the output part, how you get the gradient to do optimization? How you do text feature visualization?

Thank you for your help.

Best Wishes,

Alex

betterze avatar Mar 06 '21 11:03 betterze

Not sure if this is the same question as Alex or slightly different, but as I read and reread "multimodal" - did the text that maximizes the logit come from validation or training data that the network had already been trained on, or did you get the sentence fragments from a completely different dataset to try to control for overfitting and to have a dataset specifically for stochastic ascent process to find neurons/channels/facets that most maximally corresponded to different text fragments in the process. (Or train some GPT-3/RL to be a neuron maximizer and synthetically pump out sentence strings?)

I wonder if Alex is maybe trying to understand if the general, quasi-comprehensible "Word Salad Gibberish" flavors of the maximizers is an artifact of the data collection process of sifting through data not designed explicitly to become a de-facto compressed visio-linguistic database, or in fact an artifact of artificial optimization processes trying to reliably find strong maximizer strings and elicits a strong neuron reaction to a particular neuron by "evolving them" akin to some AutoML. Or both.

Apologies in advance if this is already answered, I'm definitely doing a deeper dive on this paper then perhaps any i can remember and taking my time, so great job! (I am curious to see anyone had figured out how to make visualizers work with transformers (VR googles :-) , but I assumed not as well since there's no map in Microscope. And I assumed that was part of why the CLIP team repeated its experiments on ResNet architectures, in part for the more mature visualization options, and opportunity for comparative analysis. thx.

ModMorph avatar Mar 10 '21 23:03 ModMorph

@gabgoh any reply? Thx in advance.

betterze avatar Mar 19 '21 23:03 betterze

FWIW, Note it does mention some of the equations for faceted visualization: "faceted feature visualization" and the datasets of the images they used to find visual activations. "We then do feature visualization by maximizing the penalized objective f(g(x))+wT(g(x)⊙∇f(g(x))) where w are the weights of that linear probe, and f \circ g is the original feature visualization objective, composed of two functions, g, which takes the input into an intermediate activations and f which takes those intermediate activations into the final objective."

I'm still have a lot to learn on feature viz, but I was going to take a look at Captum and try to play around and see if I can adapt to it, since it seemed to have the ability to be used to analyze visual question answering capabilities, probably can be made to work with CLIP: https://captum.ai/tutorials/Multimodal_VQA_Interpret

ModMorph avatar Mar 20 '21 01:03 ModMorph

On the Distill slack, there was post from Gabriel who said the following regarding text feature visualization:

Here a simple high level explanation of text feature viz: Text feature visualization is a fairly simple search algorithm that, given a series of tokens, either

  • adds a token
  • deletes a token
  • replaces a token

The choice of what to do depends on what maximizes the objective, and is done naively by feeding all options to the network. That's it ! This vanilla version already works very well but you can speed it up a bit by:

  • Guiding the search using a small BERT model
  • Adding a small LM penalty on all sentences

ProGamerGov avatar May 07 '22 19:05 ProGamerGov