haystack feat: enable loading tokenizer from YAML file for models not supported in transformers

Related Issues

fixes #issue-number

Proposed Changes:

In the _prepare_pipeline_kwargs method, we can check whether transformers would load the tokenizer if we pass it as a string. If not, we can load the tokenizer and pass it as a tokenizer object to the pipeline.

def _prepare_pipeline_kwargs(self, **kwargs) -> Dict[str, Any]:
        """
        Sanitizes and prepares the kwargs passed to the transformers pipeline function.
        For more details about pipeline kwargs in general, see Hugging Face
        [documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline).
        """
        # as device and device_map are mutually exclusive, we set device to None if device_map is provided
        device_map = kwargs.get("device_map", None)
        device = kwargs.get("device") if device_map is None else None
        # prepare torch_dtype for pipeline invocation
        torch_dtype = self._extract_torch_dtype(**kwargs)
        # and the model (prefer model instance over model_name_or_path str identifier)
        model = kwargs.get("model") or kwargs.get("model_name_or_path")
        #+++++++++++++++++++++++++++++++++++++++++++
        trust_remote_code = kwargs.get("trust_remote_code", False)
        tokenizer = kwargs.get("tokenizer", None)
        
        if isinstance(tokenizer, str):
            model_config = AutoConfig.from_pretrained(model, trust_remote_code=trust_remote_code)
            load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
            if not load_tokenizer:
                tokenizer = AutoTokenizer.from_pretrained(model,trust_remote_code=trust_remote_code)
         #+++++++++++++++++++++++++++++++++++++++++++

        pipeline_kwargs = {
            "task": kwargs.get("task", None),
            "model": model,
            "config": kwargs.get("config", None),
             ###################################################
            "tokenizer": tokenizer,
             ###################################################
            "feature_extractor": kwargs.get("feature_extractor", None),
            "revision": kwargs.get("revision", None),
            "use_auth_token": kwargs.get("use_auth_token", None),
            "device_map": device_map,
            "device": device,
            "torch_dtype": torch_dtype,
             ###################################################
            "trust_remote_code": trust_remote_code,
             ###################################################
            "model_kwargs": kwargs.get("model_kwargs", {}),
            "pipeline_class": kwargs.get("pipeline_class", None),
        }

        return pipeline_kwargs

How did you test it?

by running the example code mentioned in the related issue, I got the answer from the model. If a unit test is needed here, just let me know. I am happy to add a unit test for this. But I would like to first wait for your feedback.

Notes for the reviewer

the example in the related issue won't run through because other related issues are not merged into main. In order to make the example work, you need to overwrite the model_input_kwargs with the defined generate_kwargs .

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

Jul 11 '23 01:07 faaany

Pull Request Test Coverage Report for Build 5736945472

0 of 0 changed or added relevant lines in 0 files are covered.
14 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.03%) to 46.529%

Files with Coverage Reduction	New Missed Lines	%
utils/context_matching.py	1	91.67%
nodes/prompt/invocation_layer/hugging_face.py	13	89.73%
<!--	Total:	14

Totals
Change from base Build 5736927683:	0.03%
Covered Lines:	10960
Relevant Lines:	23555

💛 - Coveralls

Jul 11 '23 01:07 coveralls

Hi @bogdankostic, any feedback on this PR?

Jul 13 '23 07:07 faaany

@faaany Can you maybe explain why this change is necessary? In the Notes for the reviewer section, you mention a related issue, but no related issue is linked to this PR.

Jul 13 '23 13:07 bogdankostic

@faaany Can you maybe explain why this change is necessary? In the Notes for the reviewer section, you mention a related issue, but no related issue is linked to this PR.

Oh, sorry, I explained the issue in a separate link here. Basically, the current Haystack code doesn't support loading a tokenizer via the pipeline YAML file.

Jul 13 '23 13:07 faaany

Thanks for linking the issue, this definitely helped me to understand the purpose of this PR!

Jul 13 '23 16:07 bogdankostic

Hi @bogdankostic, I have refactored my code based on your comments. Could you take a look?

Jul 17 '23 18:07 faaany

@bogdankostic I kept thinking about whether my current implementation is the best approach. and today I got another idea to enable this feature. Will update my code tomorrow. Thanks for reviewing!

Jul 20 '23 15:07 faaany

@bogdankostic code updated and unit tests are added.

Compared to the last version, I added 2 additional logic:

_prepare_tokenizer will only be called when the user passed trust_remote_code=False and no tokenizer is passed
_prepare_tokenizer accepts both models passed as a string and as an instantiated object.

Note to the test:

Actually, I wanted to add 2 tests. One is for models whose tokenizer won't be loaded by transformers.pipeline. In this case, the _prepare_tokenizer method will load the tokenizer. The other one is for models whose tokenizer will be loaded. In this case, the method will return None. But I don't know a better way how to write the test to test the second case. Pls let me know if you have some ideas. Thanks!

Jul 21 '23 10:07 faaany

@bogdankostic the tests with the specified model run through on my local machine but failed here. It seems to be an environmental issue. Maybe the way how I write the test is not desired. What do you think?

Jul 21 '23 13:07 faaany

Hey @faaany, I'm not sure why you changed the logic in which cases we need to load the model manually, can you elaborate a bit on this? Regarding the test, I'd prefer to have a unit test instead of an integration test - for this, we should mock the model loading, let me know if I should help here. :)

Sure! There are 2 types of models: models supported by the transformers library and models that are not. As you can see here, the transformers library already supports a lot of models. In each model folder (e.g. if you click the bert folder), you will see files like modeling_bert.py, tokenization_bert.py. But there are also models that are not supported by transformers, e.g. the mpt-7b-chat model. As you can see here, the modeling_mpt.py exists in the Hugging Face Model Hub instead of the transformers library. In order to use this kind of model, you need to set the trust_remote_code to True when using the pipeline.

For the first type of model, there is almost no need to pass an extra tokenizer, because the pipeline can automatically load it as you can see here. For the second type of model, we need to check whether the transformers' pipeline will load the tokenizer or not. Therefore, I added the trust_remote_code in the if statement.

Jul 25 '23 07:07 faaany

@bogdankostic And yes! I need help with the unit tests. If you could point me to an example unit test or some pseudo code, that would be really awesome!

Jul 25 '23 07:07 faaany

@bogdankostic pls note that there is a slight change in my proposal. My initial suggestion was to let users specify the tokenizer name in the YAML file. But later, I found this action can be saved because the tokenizer name is the same as the provided model name anyway. So the problem this PR tries to solve is to enable the current haystack pipeline support models that are not supported by the transformers library.

Jul 25 '23 07:07 faaany

Hey @faaany, I investigated a bit and found this transformers PR (https://github.com/huggingface/transformers/pull/24629) that adds support for MPT models which got merged just a couple of minutes ago. Given that transformers added support for MPT models, I don't think this PR is needed anymore, do you agree?

Jul 25 '23 12:07 bogdankostic

@bogdankostic yes, for my use case I think the newly merged PR in transformers should fix the error. But let me try out the merged transformers library. I will give you an update tomorrow.

Jul 25 '23 12:07 faaany

but there is always models which not supported by local HF APIs and need remote execution, I think we still need these knobs here so Haystack can always support the latest model which not integrated to HF natively yet. @bogdankostic

Jul 28 '23 05:07 yao-matrix

@bogdankostic done, do you know how to workaround this release node failure? I just merge the latest code from main and then I kept having this relese node error while running the automatic test. Thanks so much for the unit test!

Jul 31 '23 11:07 faaany

Thanks for making the changes, @faaany!

About the failing release note check: We recently changed the way how our release notes are generated. We are now using a tool called reno which requires each PR to provide a short description for the release notes. You can find more details about this decision in #5408.

To fix the failing check, you need to generate a release note file. You can find instructions on how to do this in our contributing guidelines. Let me know if I should help you with that. :)

Jul 31 '23 16:07 bogdankostic

Thanks for making the changes, @faaany!

About the failing release note check: We recently changed the way how our release notes are generated. We are now using a tool called reno which requires each PR to provide a short description for the release notes. You can find more details about this decision in #5408.

To fix the failing check, you need to generate a release note file. You can find instructions on how to do this in our contributing guidelines. Let me know if I should help you with that. :)

Thanks so much! I updated my PR accordingly. Pls have a review.

Aug 01 '23 02:08 faaany

haystack haystack copied to clipboard

feat: enable loading tokenizer from YAML file for models not supported in transformers

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Pull Request Test Coverage Report for Build 5736945472

💛 - Coveralls

haystack
haystack copied to clipboard