RAGatouille icon indicating copy to clipboard operation
RAGatouille copied to clipboard

Export to Huggingface

Open sutyum opened this issue 1 year ago • 7 comments

  • [x] Pull Index form Hub
  • [x] Push Index to Hub

Will close #37


Are there any other concerns this PR should look into. Waiting to understand how best to organise the repo in order to define the push to hub part.

sutyum avatar Jan 14 '24 20:01 sutyum

Hey!

Thanks for taking this on, very appreciated! I'll review the changes tomorrow.

I don't think there's any major concerns with this, it's very straightforward and (should) be plug&play with the existing methods.

I think pushing to the hub should be implemented as a utils function (similar to how we implemented pushing a model to the hub and converting to vespa format), but there should also be a method in RAGPretrainedModel to invoke it for the currently loaded index.

bclavie avatar Jan 14 '24 21:01 bclavie

Just a quick thought: I think loading from the hub could be implemented in from_index rather than being its own function: first check if the path exists locally, otherwise attempt downloading from the hub, and OSError if it's not found there either.

This follows the style of the huggingface .from_pretrained() loaders


Also, I think it may be good to allow the loading function to take a local_dir: Optional[str] = None argument, which if present we'd pass to hf_hub_download(), and if not present we'd set it to f".ragatouille/indexes/{NAME_OF_THE_REMOTE_INDEX}", ensuring the default behaviour is consistent with the rest of the lib.

bclavie avatar Jan 14 '24 21:01 bclavie

converting to vespa format: Not sure how this format works.

So is the idea to have the repo structured in the form of this sample directory?: https://github.com/vespa-engine/sample-apps/tree/master/colbert

Containing:

services.xml
ext/ <-- the contents of ./ragatouille/indexes/index-name goes here
 something.pt
 something-else.json
 etc.pt
schemas/
  doc.sd (same one as in the sample folder)

Is this useful broadly or something specific to the Vespa project?

sutyum avatar Jan 15 '24 14:01 sutyum

Hey, my bad -- I just wanted to cite the Vespa example as how I think export functions should be implemented: https://github.com/bclavie/RAGatouille/blob/main/ragatouille/models/utils.py

No need to match Vespa's schema at all, or even export the Vespa ONNX file here at all!

You can probably copy off the code from export_to_huggingface_hub(), remove the vespa related arguments/checks and make it export the content of an index folder rather than a model folder.

However, I think there are a handful of things to take into account here:

  • If using a model available on the huggingface hub, just pushing the index as-is is enough, because metadata.json logs the checkpoint used to create an index, and loading it will fetch the model with that name for the HF hub. E.g. if you create an index with colbert-ir/colbertv2.0, that'll be in your metadata.json, so it'll properly fetch the weights from HF if you don't have them locally.
  • But, this is not the case for any model not on the hub! I think for safety's sake, exporting an index should also involve exporting the associated model, maybe to a subfolder called model? This code should be easy to tweak to do it:
[...]
      colbert_model = ColBERT(
            colbert_path,
            colbert_config=colbert_config,
        )
        print(f"Model loaded... saving export files to disk at {export_path}/model")
        try:
            save_model = colbert_model.save
        except Exception:
            save_model = colbert_model.module.save
        save_model(str(Path(export_path) / model))
[...]

Then, before exporting the index, we'd need to update the Index's config (in metadata.json) to make sure checkpoint points to path/to/your/index/model, so it loads properly!

bclavie avatar Jan 15 '24 17:01 bclavie

Hey @sutyum, let me know if you need any assistance with this! The exporting is a bit tricky because it's basically the behaviour we have now, except we need to actively load the ColBERT instance from colbert.modeling.colbert so we can dump the model with the index, without overwriting the artifact metadata. (or rather, only overwriting what we need to for the index in some cases)

I think a structure like the one you used for your manual export, but reversed, could make sense? e.g. the main folder would be the index, and the subfolder /colbert_checkpoint/ would be where we dump the model, before uploading the whole thing to the HF hub? Happy to help with the implementation of this!

bclavie avatar Jan 25 '24 21:01 bclavie

Hey @sutyum, let me know if you need any assistance with this! The exporting is a bit tricky because it's basically the behaviour we have now, except we need to actively load the ColBERT instance from colbert.modeling.colbert so we can dump the model with the index, without overwriting the artifact metadata. (or rather, only overwriting what we need to for the index in some cases)

Would be great if could take a look and see if I am doing something wrong.

  • [ ] Need to figure out how to interface with the metadata.json in order to update the checkpoint. Couldn't find an equivalent of ColbertConfig which load it into a dict
  • [ ] Need to find out how to get the absolute path of the model snapshot stored by huggingface-cli
  • [ ] Need to manually test the various cases (existing model on hf + new index, fine-tuned model + new index)

I think a structure like the one you used for your manual export, but reversed, could make sense? e.g. the main folder would be the index, and the subfolder /colbert_checkpoint/ would be where we dump the model, before uploading the whole thing to the HF hub? Happy to help with the implementation of this!

If you try the addition to the 01-basic notebook, I have added a section for uploading and using index from hf. Currently it stores the model in /model/ and the index in repo root

sutyum avatar Jan 26 '24 06:01 sutyum

Hey! Overall I think it's looking like it's shaping up pretty nicely. I think we could probably merge export_to_huggingface_hub and upload_index_and_model into a single export_to_huggingface_hub function, that'd conditionally check/take-in an is_index: bool argument to decide which steps must be performed. Happy to take a stab at this in a bit!

Need to figure out how to interface with the metadata.json in order to update the checkpoint. Couldn't find an equivalent of ColbertConfig which load it into a dict

At its core it's just a normal json file, I think for this specific case there is no issue with loading it a json file (RAGatouille tends to use srsly for file ops, so srsly.read_json/srsly.write_json), modifying the attribute and saving it back to disk.

Need to find out how to get the absolute path of the model snapshot stored by huggingface-cli

I'm not sure I understand what you mean here... do you mean the path of a model downloaded from the hub?

Need to manually test the various cases (existing model on hf + new index, fine-tuned model + new index)

Yeah! Once this is done I think we'll be nearing this being releasable, which opens up a lot of niceties! (e.g. easily download a pre-indexed wikipedia for everyone to try out, etc) Thanks for taking this on 🙏

bclavie avatar Jan 26 '24 12:01 bclavie