haystack
haystack copied to clipboard
feat: enable loading tokenizer from YAML file for models not supported in transformers
Related Issues
- fixes #issue-number
Proposed Changes:
In the _prepare_pipeline_kwargs method, we can check whether transformers would load the tokenizer if we pass it as a string. If not, we can load the tokenizer and pass it as a tokenizer object to the pipeline.
def _prepare_pipeline_kwargs(self, **kwargs) -> Dict[str, Any]:
"""
Sanitizes and prepares the kwargs passed to the transformers pipeline function.
For more details about pipeline kwargs in general, see Hugging Face
[documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline).
"""
# as device and device_map are mutually exclusive, we set device to None if device_map is provided
device_map = kwargs.get("device_map", None)
device = kwargs.get("device") if device_map is None else None
# prepare torch_dtype for pipeline invocation
torch_dtype = self._extract_torch_dtype(**kwargs)
# and the model (prefer model instance over model_name_or_path str identifier)
model = kwargs.get("model") or kwargs.get("model_name_or_path")
#+++++++++++++++++++++++++++++++++++++++++++
trust_remote_code = kwargs.get("trust_remote_code", False)
tokenizer = kwargs.get("tokenizer", None)
if isinstance(tokenizer, str):
model_config = AutoConfig.from_pretrained(model, trust_remote_code=trust_remote_code)
load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
if not load_tokenizer:
tokenizer = AutoTokenizer.from_pretrained(model,trust_remote_code=trust_remote_code)
#+++++++++++++++++++++++++++++++++++++++++++
pipeline_kwargs = {
"task": kwargs.get("task", None),
"model": model,
"config": kwargs.get("config", None),
###################################################
"tokenizer": tokenizer,
###################################################
"feature_extractor": kwargs.get("feature_extractor", None),
"revision": kwargs.get("revision", None),
"use_auth_token": kwargs.get("use_auth_token", None),
"device_map": device_map,
"device": device,
"torch_dtype": torch_dtype,
###################################################
"trust_remote_code": trust_remote_code,
###################################################
"model_kwargs": kwargs.get("model_kwargs", {}),
"pipeline_class": kwargs.get("pipeline_class", None),
}
return pipeline_kwargs
How did you test it?
by running the example code mentioned in the related issue, I got the answer from the model. If a unit test is needed here, just let me know. I am happy to add a unit test for this. But I would like to first wait for your feedback.
Notes for the reviewer
the example in the related issue won't run through because other related issues are not merged into main. In order to make the example work, you need to overwrite the model_input_kwargs with the defined generate_kwargs .
Checklist
- I have read the contributors guidelines and the code of conduct
- I have updated the related issue with new insights and changes
- I added unit tests and updated the docstrings
- I've used one of the conventional commit types for my PR title:
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:. - I documented my code
- I ran pre-commit hooks and fixed any issue
Pull Request Test Coverage Report for Build 5736945472
- 0 of 0 changed or added relevant lines in 0 files are covered.
- 14 unchanged lines in 2 files lost coverage.
- Overall coverage increased (+0.03%) to 46.529%
| Files with Coverage Reduction | New Missed Lines | % |
|---|---|---|
| utils/context_matching.py | 1 | 91.67% |
| nodes/prompt/invocation_layer/hugging_face.py | 13 | 89.73% |
| <!-- | Total: | 14 |
| Totals | |
|---|---|
| Change from base Build 5736927683: | 0.03% |
| Covered Lines: | 10960 |
| Relevant Lines: | 23555 |
💛 - Coveralls
Hi @bogdankostic, any feedback on this PR?
@faaany Can you maybe explain why this change is necessary? In the Notes for the reviewer section, you mention a related issue, but no related issue is linked to this PR.
@faaany Can you maybe explain why this change is necessary? In the Notes for the reviewer section, you mention a related issue, but no related issue is linked to this PR.
Oh, sorry, I explained the issue in a separate link here. Basically, the current Haystack code doesn't support loading a tokenizer via the pipeline YAML file.
Thanks for linking the issue, this definitely helped me to understand the purpose of this PR!
Hi @bogdankostic, I have refactored my code based on your comments. Could you take a look?
@bogdankostic I kept thinking about whether my current implementation is the best approach. and today I got another idea to enable this feature. Will update my code tomorrow. Thanks for reviewing!
@bogdankostic code updated and unit tests are added.
Compared to the last version, I added 2 additional logic:
_prepare_tokenizerwill only be called when the user passedtrust_remote_code=Falseand no tokenizer is passed_prepare_tokenizeraccepts both models passed as a string and as an instantiated object.
Note to the test:
- Actually, I wanted to add 2 tests. One is for models whose tokenizer won't be loaded by
transformers.pipeline. In this case, the_prepare_tokenizermethod will load the tokenizer. The other one is for models whose tokenizer will be loaded. In this case, the method will return None. But I don't know a better way how to write the test to test the second case. Pls let me know if you have some ideas. Thanks!
@bogdankostic the tests with the specified model run through on my local machine but failed here. It seems to be an environmental issue. Maybe the way how I write the test is not desired. What do you think?
Hey @faaany, I'm not sure why you changed the logic in which cases we need to load the model manually, can you elaborate a bit on this? Regarding the test, I'd prefer to have a unit test instead of an integration test - for this, we should mock the model loading, let me know if I should help here. :)
Sure! There are 2 types of models: models supported by the transformers library and models that are not. As you can see here, the transformers library already supports a lot of models. In each model folder (e.g. if you click the bert folder), you will see files like modeling_bert.py, tokenization_bert.py. But there are also models that are not supported by transformers, e.g. the mpt-7b-chat model. As you can see here, the modeling_mpt.py exists in the Hugging Face Model Hub instead of the transformers library. In order to use this kind of model, you need to set the trust_remote_code to True when using the pipeline.
For the first type of model, there is almost no need to pass an extra tokenizer, because the pipeline can automatically load it as you can see here. For the second type of model, we need to check whether the transformers' pipeline will load the tokenizer or not. Therefore, I added the trust_remote_code in the if statement.
@bogdankostic And yes! I need help with the unit tests. If you could point me to an example unit test or some pseudo code, that would be really awesome!
@bogdankostic pls note that there is a slight change in my proposal. My initial suggestion was to let users specify the tokenizer name in the YAML file. But later, I found this action can be saved because the tokenizer name is the same as the provided model name anyway. So the problem this PR tries to solve is to enable the current haystack pipeline support models that are not supported by the transformers library.
Hey @faaany, I investigated a bit and found this transformers PR (https://github.com/huggingface/transformers/pull/24629) that adds support for MPT models which got merged just a couple of minutes ago. Given that transformers added support for MPT models, I don't think this PR is needed anymore, do you agree?
@bogdankostic yes, for my use case I think the newly merged PR in transformers should fix the error. But let me try out the merged transformers library. I will give you an update tomorrow.
but there is always models which not supported by local HF APIs and need remote execution, I think we still need these knobs here so Haystack can always support the latest model which not integrated to HF natively yet. @bogdankostic
@bogdankostic done, do you know how to workaround this release node failure? I just merge the latest code from main and then I kept having this relese node error while running the automatic test. Thanks so much for the unit test!
Thanks for making the changes, @faaany!
About the failing release note check: We recently changed the way how our release notes are generated. We are now using a tool called reno which requires each PR to provide a short description for the release notes. You can find more details about this decision in #5408.
To fix the failing check, you need to generate a release note file. You can find instructions on how to do this in our contributing guidelines. Let me know if I should help you with that. :)
Thanks for making the changes, @faaany!
About the failing release note check: We recently changed the way how our release notes are generated. We are now using a tool called
renowhich requires each PR to provide a short description for the release notes. You can find more details about this decision in #5408.To fix the failing check, you need to generate a release note file. You can find instructions on how to do this in our contributing guidelines. Let me know if I should help you with that. :)
Thanks so much! I updated my PR accordingly. Pls have a review.