haystack icon indicating copy to clipboard operation
haystack copied to clipboard

feat: enable loading tokenizer from YAML file for models not supported in transformers

Open faaany opened this issue 2 years ago • 5 comments

Related Issues

Proposed Changes:

In the _prepare_pipeline_kwargs method, we can check whether transformers would load the tokenizer if we pass it as a string. If not, we can load the tokenizer and pass it as a tokenizer object to the pipeline.

def _prepare_pipeline_kwargs(self, **kwargs) -> Dict[str, Any]:
        """
        Sanitizes and prepares the kwargs passed to the transformers pipeline function.
        For more details about pipeline kwargs in general, see Hugging Face
        [documentation](https://huggingface.co/docs/transformers/en/main_classes/pipelines#transformers.pipeline).
        """
        # as device and device_map are mutually exclusive, we set device to None if device_map is provided
        device_map = kwargs.get("device_map", None)
        device = kwargs.get("device") if device_map is None else None
        # prepare torch_dtype for pipeline invocation
        torch_dtype = self._extract_torch_dtype(**kwargs)
        # and the model (prefer model instance over model_name_or_path str identifier)
        model = kwargs.get("model") or kwargs.get("model_name_or_path")
        #+++++++++++++++++++++++++++++++++++++++++++
        trust_remote_code = kwargs.get("trust_remote_code", False)
        tokenizer = kwargs.get("tokenizer", None)
        
        if isinstance(tokenizer, str):
            model_config = AutoConfig.from_pretrained(model, trust_remote_code=trust_remote_code)
            load_tokenizer = type(model_config) in TOKENIZER_MAPPING or model_config.tokenizer_class is not None
            if not load_tokenizer:
                tokenizer = AutoTokenizer.from_pretrained(model,trust_remote_code=trust_remote_code)
         #+++++++++++++++++++++++++++++++++++++++++++

        pipeline_kwargs = {
            "task": kwargs.get("task", None),
            "model": model,
            "config": kwargs.get("config", None),
             ###################################################
            "tokenizer": tokenizer,
             ###################################################
            "feature_extractor": kwargs.get("feature_extractor", None),
            "revision": kwargs.get("revision", None),
            "use_auth_token": kwargs.get("use_auth_token", None),
            "device_map": device_map,
            "device": device,
            "torch_dtype": torch_dtype,
             ###################################################
            "trust_remote_code": trust_remote_code,
             ###################################################
            "model_kwargs": kwargs.get("model_kwargs", {}),
            "pipeline_class": kwargs.get("pipeline_class", None),
        }

        return pipeline_kwargs

How did you test it?

by running the example code mentioned in the related issue, I got the answer from the model. If a unit test is needed here, just let me know. I am happy to add a unit test for this. But I would like to first wait for your feedback.

Notes for the reviewer

the example in the related issue won't run through because other related issues are not merged into main. In order to make the example work, you need to overwrite the model_input_kwargs with the defined generate_kwargs .

Checklist

faaany avatar Jul 11 '23 01:07 faaany

Pull Request Test Coverage Report for Build 5736945472

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 14 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.03%) to 46.529%

Files with Coverage Reduction New Missed Lines %
utils/context_matching.py 1 91.67%
nodes/prompt/invocation_layer/hugging_face.py 13 89.73%
<!-- Total: 14
Totals Coverage Status
Change from base Build 5736927683: 0.03%
Covered Lines: 10960
Relevant Lines: 23555

💛 - Coveralls

coveralls avatar Jul 11 '23 01:07 coveralls

Hi @bogdankostic, any feedback on this PR?

faaany avatar Jul 13 '23 07:07 faaany

@faaany Can you maybe explain why this change is necessary? In the Notes for the reviewer section, you mention a related issue, but no related issue is linked to this PR.

bogdankostic avatar Jul 13 '23 13:07 bogdankostic

@faaany Can you maybe explain why this change is necessary? In the Notes for the reviewer section, you mention a related issue, but no related issue is linked to this PR.

Oh, sorry, I explained the issue in a separate link here. Basically, the current Haystack code doesn't support loading a tokenizer via the pipeline YAML file.

faaany avatar Jul 13 '23 13:07 faaany

Thanks for linking the issue, this definitely helped me to understand the purpose of this PR!

bogdankostic avatar Jul 13 '23 16:07 bogdankostic

Hi @bogdankostic, I have refactored my code based on your comments. Could you take a look?

faaany avatar Jul 17 '23 18:07 faaany

@bogdankostic I kept thinking about whether my current implementation is the best approach. and today I got another idea to enable this feature. Will update my code tomorrow. Thanks for reviewing!

faaany avatar Jul 20 '23 15:07 faaany

@bogdankostic code updated and unit tests are added.

Compared to the last version, I added 2 additional logic:

  • _prepare_tokenizer will only be called when the user passed trust_remote_code=False and no tokenizer is passed
  • _prepare_tokenizer accepts both models passed as a string and as an instantiated object.

Note to the test:

  • Actually, I wanted to add 2 tests. One is for models whose tokenizer won't be loaded by transformers.pipeline. In this case, the _prepare_tokenizer method will load the tokenizer. The other one is for models whose tokenizer will be loaded. In this case, the method will return None. But I don't know a better way how to write the test to test the second case. Pls let me know if you have some ideas. Thanks!

faaany avatar Jul 21 '23 10:07 faaany

@bogdankostic the tests with the specified model run through on my local machine but failed here. It seems to be an environmental issue. Maybe the way how I write the test is not desired. What do you think?

faaany avatar Jul 21 '23 13:07 faaany

Hey @faaany, I'm not sure why you changed the logic in which cases we need to load the model manually, can you elaborate a bit on this? Regarding the test, I'd prefer to have a unit test instead of an integration test - for this, we should mock the model loading, let me know if I should help here. :)

Sure! There are 2 types of models: models supported by the transformers library and models that are not. As you can see here, the transformers library already supports a lot of models. In each model folder (e.g. if you click the bert folder), you will see files like modeling_bert.py, tokenization_bert.py. But there are also models that are not supported by transformers, e.g. the mpt-7b-chat model. As you can see here, the modeling_mpt.py exists in the Hugging Face Model Hub instead of the transformers library. In order to use this kind of model, you need to set the trust_remote_code to True when using the pipeline.

For the first type of model, there is almost no need to pass an extra tokenizer, because the pipeline can automatically load it as you can see here. For the second type of model, we need to check whether the transformers' pipeline will load the tokenizer or not. Therefore, I added the trust_remote_code in the if statement.

faaany avatar Jul 25 '23 07:07 faaany

@bogdankostic And yes! I need help with the unit tests. If you could point me to an example unit test or some pseudo code, that would be really awesome!

faaany avatar Jul 25 '23 07:07 faaany

@bogdankostic pls note that there is a slight change in my proposal. My initial suggestion was to let users specify the tokenizer name in the YAML file. But later, I found this action can be saved because the tokenizer name is the same as the provided model name anyway. So the problem this PR tries to solve is to enable the current haystack pipeline support models that are not supported by the transformers library.

faaany avatar Jul 25 '23 07:07 faaany

Hey @faaany, I investigated a bit and found this transformers PR (https://github.com/huggingface/transformers/pull/24629) that adds support for MPT models which got merged just a couple of minutes ago. Given that transformers added support for MPT models, I don't think this PR is needed anymore, do you agree?

bogdankostic avatar Jul 25 '23 12:07 bogdankostic

@bogdankostic yes, for my use case I think the newly merged PR in transformers should fix the error. But let me try out the merged transformers library. I will give you an update tomorrow.

faaany avatar Jul 25 '23 12:07 faaany

but there is always models which not supported by local HF APIs and need remote execution, I think we still need these knobs here so Haystack can always support the latest model which not integrated to HF natively yet. @bogdankostic

yao-matrix avatar Jul 28 '23 05:07 yao-matrix

@bogdankostic done, do you know how to workaround this release node failure? I just merge the latest code from main and then I kept having this relese node error while running the automatic test. Thanks so much for the unit test!

faaany avatar Jul 31 '23 11:07 faaany

Thanks for making the changes, @faaany!

About the failing release note check: We recently changed the way how our release notes are generated. We are now using a tool called reno which requires each PR to provide a short description for the release notes. You can find more details about this decision in #5408.

To fix the failing check, you need to generate a release note file. You can find instructions on how to do this in our contributing guidelines. Let me know if I should help you with that. :)

bogdankostic avatar Jul 31 '23 16:07 bogdankostic

Thanks for making the changes, @faaany!

About the failing release note check: We recently changed the way how our release notes are generated. We are now using a tool called reno which requires each PR to provide a short description for the release notes. You can find more details about this decision in #5408.

To fix the failing check, you need to generate a release note file. You can find instructions on how to do this in our contributing guidelines. Let me know if I should help you with that. :)

Thanks so much! I updated my PR accordingly. Pls have a review.

faaany avatar Aug 01 '23 02:08 faaany