text
text copied to clipboard
Separate object construction from file reading
🚀 Feature
Tracking issue for this discussion. File-reading should probably be handled in a classmethod so files don't need to be written to construct certain objects.
One example would be the CLIPTokenizer, which currently accepts files. Ideally a user can provide their own merges, but to maintain the convenience of reading from a file, we could use from_pretrained or similar:
tokenizer = CLIPTokenizer.from_pretrained(encoder_json_path, vocab_bpe_path)
Thanks @erip for creating this issue. Would be great if you can also provide some mock-code/proposals above to elaborate on the idea and give direction to the discussion/requested feature :)
Ideally one could do something like this to avoid having to deal with differing file paths on different computers:
url_path = # Download URL for bpe merges.
clip_merges = torch.hub.load_state_dict_from_url(url_path).read().decode("utf-8").split('\n')[1:]
clip_tokenizer = torchtext.transforms.CLIPTokenizer(merges_path=clip_merges, num_merges=49152-256-2+1)