llm-foundry
llm-foundry copied to clipboard
Remote JSONL IFT data
Support remote jsonl files for finetuning.
When I run this I am seeing the s3 download fail at 29% with
Downloading ift/jsonl_test: 0%| | 0.00/906k [00:00<?, ?iB/s]
Downloading ift/jsonl_test: 29%|██▉ | 262k/906k [00:00<00:00, 3.01MiB/s]
Downloading ift/jsonl_test: 0%| | 0.00/906k [00:00<?, ?iB/s]Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/composer/utils/object_store/s3_object_store.py", line 120, in get_object_size
obj = self.client.get_object(Bucket=self.bucket, Key=self.get_key(object_name))
File "/usr/lib/python3/dist-packages/botocore/client.py", line 530, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/lib/python3/dist-packages/botocore/client.py", line 964, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/composer/utils/file_helpers.py", line 504, in get_file
return get_file(
File "/usr/lib/python3/dist-packages/composer/utils/file_helpers.py", line 471, in get_file
_get_file(
File "/usr/lib/python3/dist-packages/composer/utils/file_helpers.py", line 526, in _get_file
total_size_in_bytes = object_store.get_object_size(path)
File "/usr/lib/python3/dist-packages/composer/utils/object_store/s3_object_store.py", line 122, in get_object_size
_ensure_not_found_errors_are_wrapped(self.get_uri(object_name), e)
File "/usr/lib/python3/dist-packages/composer/utils/object_store/s3_object_store.py", line 26, in _ensure_not_found_errors_are_wrapped
raise FileNotFoundError(f'Object {uri} not found') from e
FileNotFoundError: Object s3://mosaicml-internal-checkpoints-shared/ift/jsonl_test.symlink not found
I think all that's left here @vchiley is to do a quick test with a remote JSON file (e.g. upload the sample data to S3 and try pointing at it). And then we add a sentence in the finetuning instructions clarifying that "Yes, you can point cfg.dataset.hf_name at a remote URI"
https://wandb.ai/mosaic-ml/domain-adapt-remote-jsonl-ift/runs/jox5h54l shows training with this functionality
seems legit