benchmarks MosaicBERT: Convert composer weights to HF

MosaicBERT: Convert composer weights to HF

Open stefan-it opened this issue 1 year ago • 1 comments

Hi,

we could sucessfully pretrain various MosaicBERT models and evaluations with composer-based fine-tuning look really good :)

However, when using a/the conversion script llm-foundry/scripts/inference/convert_composer_to_hf.py the converted HF model seems to be initialized randomly and the MLM predictions are looking super random.

I used the conversion script from the llm-foundry repository like this:

$ python3 /mnt/llm-foundry/scripts/inference/convert_composer_to_hf.py --composer_path ep111-ba125000-rank0.pt --hf_output_path ./converted-3 --output_precision fp32

It then shows, that various weights are not correctly initalized:

HF checkpoint folder successfully created at ./converted-3.                                                              
Loading model from ./converted-3                                                                                         
If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`                                             
Some weights of BertLMHeadModel were not initialized from the model checkpoint at ./converted-3 and are newly initialized
: ['bert.encoder.layer.7.attention.self.key.bias', 'bert.encoder.layer.11.output.LayerNorm.weight', 'bert.encoder.layer.7
.attention.self.query.weight', 'bert.encoder.layer.10.output.LayerNorm.bias', 'bert.encoder.layer.4.output.dense.bias', '
bert.encoder.layer.8.attention.self.key.bias', 'bert.encoder.layer.5.output.LayerNorm.bias', 'bert.encoder.layer.1.output
.dense.weight', 'bert.encoder.layer.2.output.dense.bias', 'bert.encoder.layer.8.attention.self.value.bias', 'bert.encoder
.layer.5.intermediate.dense.weight', 'bert.encoder.layer.0.attention.self.value.bias', 'bert.encoder.layer.1.intermediate
.dense.bias', 'bert.encoder.layer.1.attention.self.query.weight', 'bert.encoder.layer.8.attention.self.query.weight', 'be
rt.encoder.layer.2.attention.self.key.weight', 'bert.encoder.layer.2.output.LayerNorm.weight', 'bert.encoder.layer.3.atte
ntion.self.query.bias', 'bert.encoder.layer.11.attention.self.value.weight', 'bert.encoder.layer.2.attention.self.value.b
ias', 'bert.encoder.layer.4.attention.self.value.bias', 'bert.encoder.layer.0.attention.self.key.weight', 'bert.encoder.l
ayer.2.attention.self.key.bias', 'bert.encoder.layer.6.attention.self.key.weight', 'bert.encoder.layer.5.attention.self.k
ey.bias', 'bert.encoder.layer.9.attention.self.query.weight', 'bert.encoder.layer.7.attention.self.value.weight', 'bert.e
ncoder.layer.8.output.dense.weight', 'bert.encoder.layer.4.attention.self.key.bias', 'bert.encoder.layer.11.attention.sel
f.value.bias', 'bert.encoder.layer.4.attention.self.key.weight', 'bert.encoder.layer.7.intermediate.dense.bias', 'bert.en
coder.layer.5.output.dense.bias', 'bert.encoder.layer.8.attention.self.value.weight', 'bert.encoder.layer.5.attention.sel
f.query.weight', 'bert.encoder.layer.4.attention.self.value.weight', 'bert.encoder.layer.9.intermediate.dense.weight', 'b
ert.encoder.layer.3.output.LayerNorm.bias', 'bert.encoder.layer.6.intermediate.dense.bias', 'bert.encoder.layer.3.interme
diate.dense.weight', 'bert.encoder.layer.9.attention.self.value.bias', 'bert.encoder.layer.4.output.LayerNorm.weight', 'b
ert.encoder.layer.3.output.LayerNorm.weight', 'bert.encoder.layer.5.attention.self.value.weight', 'bert.encoder.layer.10.
attention.self.key.weight', 'bert.encoder.layer.3.intermediate.dense.bias', 'bert.encoder.layer.9.output.LayerNorm.bias',
 'bert.encoder.layer.11.attention.self.query.bias', 'bert.encoder.layer.11.intermediate.dense.bias', 'bert.encoder.layer.
0.attention.self.key.bias', 'bert.encoder.layer.7.output.LayerNorm.bias', 'bert.encoder.layer.0.output.dense.weight', 'be
rt.encoder.layer.6.attention.self.query.weight', 'bert.encoder.layer.11.output.LayerNorm.bias', 'bert.encoder.layer.5.out
put.LayerNorm.weight', 'bert.encoder.layer.9.output.dense.bias', 'bert.encoder.layer.6.attention.self.key.bias', 'bert.en
coder.layer.1.intermediate.dense.weight', 'bert.encoder.layer.10.attention.self.query.weight', 'bert.encoder.layer.3.atte
ntion.self.query.weight', 'bert.encoder.layer.9.output.dense.weight', 'bert.encoder.layer.1.attention.self.key.weight', '
bert.encoder.layer.10.output.LayerNorm.weight', 'bert.encoder.layer.0.attention.self.value.weight', 'bert.encoder.layer.2
.attention.self.query.bias', 'bert.encoder.layer.8.output.dense.bias', 'bert.encoder.layer.0.output.LayerNorm.weight'
[...]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Is there any special conversion script/hints for converting a MosaicBERT composer checkpoint :thinking:

Any help is highly appreciated!

Jan 25 '24 08:01 stefan-it

Hi, the conversion script in LLM Foundry is not intended for MosaicBERT, which still lives here in examples repo. To export it properly with the code files, you'll need to do some manual movement of the code files. See my other answer as well: https://github.com/mosaicml/examples/issues/401#issuecomment-1629846290

Jan 25 '24 22:01 dakinggg

benchmarks benchmarks copied to clipboard

MosaicBERT: Convert composer weights to HF

benchmarks
benchmarks copied to clipboard