Add Mixtral to `test_match_huggingface` test

Open joelburget opened this issue 1 year ago • 1 comments

Description

This PR is a demonstration / for discussion. I've extended the existing test_match_huggingface test, adding a Mixtral test. As implemented, I'm sure this breaks other things, but this PR is really just intended more as a proof-of-concept and a demonstration of an unexpected difficulty making it work ((2) below).

A few things of note:

GPT2 and Mixtral don't use the same names in transformers so I had to remove GPT2 from the test. (See the commented out hf_out lines). This is easily fixed by not parametrizing the test but probably leads to some duplicated code.
I had to change W_in, W_gate, and W_out to be the transpose of how they were previously represented. See https://stackoverflow.com/q/78745180/383958 for reasoning.
I had to rewrite most of forward to get tests to pass. I think there was an actual difference in logic compared to Mixtral but I'm not entirely sure.
The test uses a model I made (see script demonstrating how), which just contains the first layer of Mixtral-7B. This makes it possible to actually run the test on a laptop.

Type of change

Testing / discussion. This is a follow-up to #570 / #641.

Checklist:

[x] I have commented my code, particularly in hard-to-understand areas
[x] I have made corresponding changes to the documentation
[x] My changes generate no new warnings
[x] I have added tests that prove my fix is effective or that my feature works
[ ] New and existing unit tests pass locally with my changes
[ ] I have not rewritten tests relating to key interfaces which would affect backward compatibility

Jul 15 '24 00:07 joelburget

There are quite a few differences compared to the current released version of TransformerLens. I would be curious to see this test running against that current version. In the current version, the MLP used for MoE is actually a specialized MLP that only exists within the MoE component. This MLP could be split out into a UnbiasedGatedMLP, and that probably will happen, but the point is that the current implementation is quite a bit different than what is here, or from what the starting point was for this PR.

That being said, if I updated your branch to the current dev branch, would you mind? It will probably overwrite everything you did in Gated MLP and MoE. If you want, I can do it in a separate PR to your branch, so that we can take a look at them side by side.

Also, for the bigger picture, my current plan for when I am free to work on TransformerLens (probably starting next week) is to complete a major revision on weight conversions. Basically, I am going to rewrite the whole module so that instead of every mapping being a line of code, we would have a base class that would take a key mapping. It will then use those key mappings to convert weights in a generalized fashion at the base, with an optional callback to handle more complicated mapping on individual key conversions. The goal of doing this is to allow for the actual key mapping to become reusable. Thus, we can have some benchmark utilities that will take in the weight conversion key mappings in order to automatically match outputs from the source to where they exist within TransformerLens. This major change may help with making this test more easily generalized to accommodate multiple models.

Jul 17 '24 22:07 bryce13950