llm-foundry icon indicating copy to clipboard operation
llm-foundry copied to clipboard

Remote JSONL IFT data

Open samhavens opened this issue 2 years ago • 1 comments

Support remote jsonl files for finetuning.

samhavens avatar Jun 02 '23 21:06 samhavens

When I run this I am seeing the s3 download fail at 29% with

Downloading ift/jsonl_test:   0%|          | 0.00/906k [00:00<?, ?iB/s]
Downloading ift/jsonl_test:  29%|██▉       | 262k/906k [00:00<00:00, 3.01MiB/s]

Downloading ift/jsonl_test:   0%|          | 0.00/906k [00:00<?, ?iB/s]Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/composer/utils/object_store/s3_object_store.py", line 120, in get_object_size
    obj = self.client.get_object(Bucket=self.bucket, Key=self.get_key(object_name))
  File "/usr/lib/python3/dist-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/lib/python3/dist-packages/botocore/client.py", line 964, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/composer/utils/file_helpers.py", line 504, in get_file
    return get_file(
  File "/usr/lib/python3/dist-packages/composer/utils/file_helpers.py", line 471, in get_file
    _get_file(
  File "/usr/lib/python3/dist-packages/composer/utils/file_helpers.py", line 526, in _get_file
    total_size_in_bytes = object_store.get_object_size(path)
  File "/usr/lib/python3/dist-packages/composer/utils/object_store/s3_object_store.py", line 122, in get_object_size
    _ensure_not_found_errors_are_wrapped(self.get_uri(object_name), e)
  File "/usr/lib/python3/dist-packages/composer/utils/object_store/s3_object_store.py", line 26, in _ensure_not_found_errors_are_wrapped
    raise FileNotFoundError(f'Object {uri} not found') from e
FileNotFoundError: Object s3://mosaicml-internal-checkpoints-shared/ift/jsonl_test.symlink not found

samhavens avatar Jun 03 '23 07:06 samhavens

I think all that's left here @vchiley is to do a quick test with a remote JSON file (e.g. upload the sample data to S3 and try pointing at it). And then we add a sentence in the finetuning instructions clarifying that "Yes, you can point cfg.dataset.hf_name at a remote URI"

abhi-mosaic avatar Jun 20 '23 14:06 abhi-mosaic

https://wandb.ai/mosaic-ml/domain-adapt-remote-jsonl-ift/runs/jox5h54l shows training with this functionality Screenshot 2023-06-21 at 10 13 03 PM seems legit

vchiley avatar Jun 22 '23 05:06 vchiley