composer icon indicating copy to clipboard operation
composer copied to clipboard

Support Cloudflare R2 for object store check-pointing

Open vedantroy opened this issue 2 years ago • 7 comments

Cloudflare R2 is cheaper than S3 (no egress fees) and is very easy to use. It is fully S3 compatible, so it should be possible to just use the S3 object store logger (if I can specify the Cloudflare endpoint)

vedantroy avatar Aug 22 '22 21:08 vedantroy

Hi @vedantroy -- In fact, you already can specify the endpoint URL for Cloudflare R2. From the comments in the source here:

endpoint_url (str, optional): The URL to an S3-Compatible object store. Must be specified if using something
            other than Amazon S3, like Google Cloud Storage. Defaults to None.`\

Using this option, I believe you can initialize your S3ObjectStore object like this:

train_remote = S3ObjectStore(
    endpoint_url="https://<ACCOUNT_ID>.r2.cloudflarestorage.com",
    bucket=s3_bucket_name,
    prefix='/train',
)

This is supported in the current version of Composer. Please give this a try and confirm whether it works for you.

If it does work, I'll be happy to keep an eye out for how we can make that information more discoverable in our documentation. Thanks in advance for your feedback!

kobindra avatar Aug 22 '22 23:08 kobindra

@kobindra

contrastive_train-contrastive_train-1  | Traceback (most recent call last):
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "contrastive_train.py", line 63, in <module>
contrastive_train-contrastive_train-1  |     app()
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "contrastive_train.py", line 52, in train
contrastive_train-contrastive_train-1  |     run_trainer(
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/app/util/run.py", line 19, in run_trainer
contrastive_train-contrastive_train-1  |     trainer = make_trainer(
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/app/trainer/trainer.py", line 127, in make_trainer
contrastive_train-contrastive_train-1  |     return Trainer(
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 918, in __init__
contrastive_train-contrastive_train-1  |     self.engine.run_event(Event.INIT)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/engine.py", line 235, in run_event
contrastive_train-contrastive_train-1  |     self._run_callbacks(event)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/engine.py", line 414, in _run_callbacks
contrastive_train-contrastive_train-1  |     cb.run_event(event, self.state, self.logger)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/callback.py", line 96, in run_event
contrastive_train-contrastive_train-1  |     return event_cb(state, logger)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 271, in init
contrastive_train-contrastive_train-1  |     retry(ObjectStoreTransientError,
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/utils/retrying.py", line 87, in new_func
contrastive_train-contrastive_train-1  |     return func(*args, **kwargs)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 272, in <lambda>
contrastive_train-contrastive_train-1  |     self.num_attempts)(lambda: _validate_credentials(self.object_store, object_name_to_test))()
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 254, in object_store
contrastive_train-contrastive_train-1  |     self._object_store = _build_object_store(self.object_store_cls, self.object_store_kwargs)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 40, in _build_object_store
contrastive_train-contrastive_train-1  |     return object_store_cls(**object_store_kwargs)  # type: ignore
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  | TypeError: __init__() got an unexpected keyword argument 'bucket_name'

vedantroy avatar Aug 25 '22 19:08 vedantroy

oh, it's bucket not bucket_name

vedantroy avatar Aug 25 '22 19:08 vedantroy

@kobindra Is there a way to specify the folder name for the checkpoints. for example, I don't really want it to be "some random integer + a word", I would much prefer it if I could specify the folder name to some meaningful id

vedantroy avatar Aug 25 '22 19:08 vedantroy

Hi @vedantroy , yes you can provide the save_folder argument for the Trainer. For more details on checkpointing, see our guide here: https://docs.mosaicml.com/en/v0.9.0/trainer/checkpointing.html, and let me know if anything there is unclear or confusing!

hanlint avatar Aug 25 '22 19:08 hanlint

Doesn't work, see:

contrastive_train-contrastive_train-1  | Traceback (most recent call last):
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/boto3/s3/transfer.py", line 288, in upload_file
contrastive_train-contrastive_train-1  |     future.result()
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/futures.py", line 103, in result
contrastive_train-contrastive_train-1  |     return self._coordinator.result()
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/futures.py", line 266, in result
contrastive_train-contrastive_train-1  |     raise self._exception
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/httpsession.py", line 448, in send
contrastive_train-contrastive_train-1  |     urllib_response = conn.urlopen(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
contrastive_train-contrastive_train-1  |     httplib_response = self._make_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request
contrastive_train-contrastive_train-1  |     conn.request(method, url, **httplib_request_kw)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request
contrastive_train-contrastive_train-1  |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1256, in request
contrastive_train-contrastive_train-1  |     self._send_request(method, url, body, headers, encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 94, in _send_request
contrastive_train-contrastive_train-1  |     rval = super()._send_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1302, in _send_request
contrastive_train-contrastive_train-1  |     self.endheaders(body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1251, in endheaders
contrastive_train-contrastive_train-1  |     self._send_output(message_body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 130, in _send_output
contrastive_train-contrastive_train-1  |     self._handle_expect_response(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
contrastive_train-contrastive_train-1  |     self._send_message_body(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
contrastive_train-contrastive_train-1  |     self.send(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 218, in send
contrastive_train-contrastive_train-1  |     return super().send(str)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 964, in send
contrastive_train-contrastive_train-1  |     datablock = data.read(self.blocksize)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/utils.py", line 511, in read
contrastive_train-contrastive_train-1  |     data = self._fileobj.read(amount_to_read)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/upload.py", line 90, in read
contrastive_train-contrastive_train-1  |     raise self._transfer_coordinator.exception
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/httpsession.py", line 448, in send
contrastive_train-contrastive_train-1  |     urllib_response = conn.urlopen(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
contrastive_train-contrastive_train-1  |     httplib_response = self._make_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request
contrastive_train-contrastive_train-1  |     conn.request(method, url, **httplib_request_kw)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request
contrastive_train-contrastive_train-1  |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1256, in request
contrastive_train-contrastive_train-1  |     self._send_request(method, url, body, headers, encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 94, in _send_request
contrastive_train-contrastive_train-1  |     rval = super()._send_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1302, in _send_request
contrastive_train-contrastive_train-1  |     self.endheaders(body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1251, in endheaders
contrastive_train-contrastive_train-1  |     self._send_output(message_body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 130, in _send_output
contrastive_train-contrastive_train-1  |     self._handle_expect_response(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
contrastive_train-contrastive_train-1  |     self._send_message_body(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
contrastive_train-contrastive_train-1  |     self.send(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 218, in send
contrastive_train-contrastive_train-1  |     return super().send(str)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 964, in send
contrastive_train-contrastive_train-1  |     datablock = data.read(self.blocksize)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/utils.py", line 511, in read
contrastive_train-contrastive_train-1  |     data = self._fileobj.read(amount_to_read)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/upload.py", line 90, in read
contrastive_train-contrastive_train-1  |     raise self._transfer_coordinator.exception
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/httpsession.py", line 448, in send
contrastive_train-contrastive_train-1  |     urllib_response = conn.urlopen(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
contrastive_train-contrastive_train-1  |     httplib_response = self._make_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request
contrastive_train-contrastive_train-1  |     conn.request(method, url, **httplib_request_kw)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request
contrastive_train-contrastive_train-1  |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1256, in request
contrastive_train-contrastive_train-1  |     self._send_request(method, url, body, headers, encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 94, in _send_request
contrastive_train-contrastive_train-1  |     rval = super()._send_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1302, in _send_request
contrastive_train-contrastive_train-1  |     self.endheaders(body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1251, in endheaders
contrastive_train-contrastive_train-1  |     self._send_output(message_body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 130, in _send_output
contrastive_train-contrastive_train-1  |     self._handle_expect_response(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
contrastive_train-contrastive_train-1  |     self._send_message_body(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
contrastive_train-contrastive_train-1  |     self.send(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 218, in send
contrastive_train-contrastive_train-1  |     return super().send(str)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 964, in send
contrastive_train-contrastive_train-1  |     datablock = data.read(self.blocksize)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/utils.py", line 511, in read
contrastive_train-contrastive_train-1  |     data = self._fileobj.read(amount_to_read)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/upload.py", line 90, in read
contrastive_train-contrastive_train-1  |     raise self._transfer_coordinator.exception
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/httpsession.py", line 448, in send
contrastive_train-contrastive_train-1  |     urllib_response = conn.urlopen(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
contrastive_train-contrastive_train-1  |     httplib_response = self._make_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request
contrastive_train-contrastive_train-1  |     conn.request(method, url, **httplib_request_kw)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request
contrastive_train-contrastive_train-1  |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1256, in request
contrastive_train-contrastive_train-1  |     self._send_request(method, url, body, headers, encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 94, in _send_request
contrastive_train-contrastive_train-1  |     rval = super()._send_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1302, in _send_request
contrastive_train-contrastive_train-1  |     self.endheaders(body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1251, in endheaders
contrastive_train-contrastive_train-1  |     self._send_output(message_body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 130, in _send_output
contrastive_train-contrastive_train-1  |     self._handle_expect_response(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
contrastive_train-contrastive_train-1  |     self._send_message_body(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
contrastive_train-contrastive_train-1  |     self.send(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 218, in send
contrastive_train-contrastive_train-1  |     return super().send(str)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 964, in send
contrastive_train-contrastive_train-1  |     datablock = data.read(self.blocksize)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/utils.py", line 511, in read
contrastive_train-contrastive_train-1  |     data = self._fileobj.read(amount_to_read)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/upload.py", line 90, in read
contrastive_train-contrastive_train-1  |     raise self._transfer_coordinator.exception
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/httpsession.py", line 448, in send
contrastive_train-contrastive_train-1  |     urllib_response = conn.urlopen(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 703, in urlopen
contrastive_train-contrastive_train-1  |     httplib_response = self._make_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connectionpool.py", line 398, in _make_request
contrastive_train-contrastive_train-1  |     conn.request(method, url, **httplib_request_kw)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/urllib3/connection.py", line 239, in request
contrastive_train-contrastive_train-1  |     super(HTTPConnection, self).request(method, url, body=body, headers=headers)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1256, in request
contrastive_train-contrastive_train-1  |     self._send_request(method, url, body, headers, encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 94, in _send_request
contrastive_train-contrastive_train-1  |     rval = super()._send_request(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1302, in _send_request
contrastive_train-contrastive_train-1  |     self.endheaders(body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 1251, in endheaders
contrastive_train-contrastive_train-1  |     self._send_output(message_body, encode_chunked=encode_chunked)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 130, in _send_output
contrastive_train-contrastive_train-1  |     self._handle_expect_response(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 176, in _handle_expect_response
contrastive_train-contrastive_train-1  |     self._send_message_body(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 209, in _send_message_body
contrastive_train-contrastive_train-1  |     self.send(message_body)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/awsrequest.py", line 218, in send
contrastive_train-contrastive_train-1  |     return super().send(str)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/http/client.py", line 964, in send
contrastive_train-contrastive_train-1  |     datablock = data.read(self.blocksize)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/utils.py", line 511, in read
contrastive_train-contrastive_train-1  |     data = self._fileobj.read(amount_to_read)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/upload.py", line 90, in read
contrastive_train-contrastive_train-1  |     raise self._transfer_coordinator.exception
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/tasks.py", line 139, in __call__
contrastive_train-contrastive_train-1  |     return self._execute_main(kwargs)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/tasks.py", line 162, in _execute_main
contrastive_train-contrastive_train-1  |     return_value = self._main(**kwargs)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/s3transfer/upload.py", line 787, in _main
contrastive_train-contrastive_train-1  |     response = client.upload_part(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/client.py", line 512, in _api_call
contrastive_train-contrastive_train-1  |     return self._make_api_call(operation_name, kwargs)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/botocore/client.py", line 919, in _make_api_call
contrastive_train-contrastive_train-1  |     raise error_class(parsed_response, operation_name)
contrastive_train-contrastive_train-1  | botocore.exceptions.ClientError: An error occurred (ServiceUnavailable) when calling the UploadPart operation (reached max retries: 4): Reduce your concurrent request rate for the same object.
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  | During handling of the above exception, another exception occurred:
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  | Traceback (most recent call last):
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
contrastive_train-contrastive_train-1  |     self.run()
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/multiprocessing/process.py", line 108, in run
contrastive_train-contrastive_train-1  |     self._target(*self._args, **self._kwargs)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 534, in _upload_worker
contrastive_train-contrastive_train-1  |     upload_file()
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/utils/retrying.py", line 87, in new_func
contrastive_train-contrastive_train-1  |     return func(*args, **kwargs)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 526, in upload_file
contrastive_train-contrastive_train-1  |     object_store.upload_object(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/utils/object_store/s3_object_store.py", line 130, in upload_object
contrastive_train-contrastive_train-1  |     self.client.upload_file(Bucket=self.bucket,
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/boto3/s3/inject.py", line 143, in upload_file
contrastive_train-contrastive_train-1  |     return transfer.upload_file(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/boto3/s3/transfer.py", line 294, in upload_file
contrastive_train-contrastive_train-1  |     raise S3UploadFailedError(
contrastive_train-contrastive_train-1  | boto3.exceptions.S3UploadFailedError: Failed to upload /tmp/tmp03iu288a/8705ce1f-8ffd-4ecc-8724-2f5dfe7f2913 to checkpoints/train/1661480783-olive-dragonfly/checkpoints/ep0-ba40-rank0: An error occurred (ServiceUnavailable) when calling the UploadPart operation (reached max retries: 4): Reduce your concurrent request rate for the same object.
contrastive_train-contrastive_train-1  |                Traceback (most recent call last):
contrastive_train-contrastive_train-1  |   File "contrastive_train.py", line 62, in <module>
contrastive_train-contrastive_train-1  |     app()
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/typer/main.py", line 328, in __call__
contrastive_train-contrastive_train-1  |     raise e
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/typer/main.py", line 311, in __call__
contrastive_train-contrastive_train-1  |     return get_command(self)(*args, **kwargs)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/click/core.py", line 1130, in __call__
contrastive_train-contrastive_train-1  |     return self.main(*args, **kwargs)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/typer/core.py", line 778, in main
contrastive_train-contrastive_train-1  |     return _main(
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/typer/core.py", line 216, in _main
contrastive_train-contrastive_train-1  |     rv = self.invoke(ctx)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/click/core.py", line 1657, in invoke
contrastive_train-contrastive_train-1  |     return _process_result(sub_ctx.command.invoke(sub_ctx))
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
contrastive_train-contrastive_train-1  |     return ctx.invoke(self.callback, **ctx.params)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/click/core.py", line 760, in invoke
contrastive_train-contrastive_train-1  |     return __callback(*args, **kwargs)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/typer/main.py", line 683, in wrapper
contrastive_train-contrastive_train-1  |     return callback(**use_params)  # type: ignore
contrastive_train-contrastive_train-1  |   File "contrastive_train.py", line 51, in train
contrastive_train-contrastive_train-1  |     run_trainer(
contrastive_train-contrastive_train-1  |   File "/app/util/run.py", line 26, in run_trainer
contrastive_train-contrastive_train-1  |     trainer.fit()
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1386, in fit
contrastive_train-contrastive_train-1  |     self._train_loop()
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1600, in _train_loop
contrastive_train-contrastive_train-1  |     self.engine.run_event(Event.BATCH_END)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/engine.py", line 239, in run_event
contrastive_train-contrastive_train-1  |     self._run_callbacks(event)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/engine.py", line 414, in _run_callbacks
contrastive_train-contrastive_train-1  |     cb.run_event(event, self.state, self.logger)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/callback.py", line 96, in run_event
contrastive_train-contrastive_train-1  |     return event_cb(state, logger)
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 294, in batch_end
contrastive_train-contrastive_train-1  |     self._check_workers()
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 311, in _check_workers
contrastive_train-contrastive_train-1  |     raise RuntimeError('Upload worker crashed. Please check the logs.')
contrastive_train-contrastive_train-1  | RuntimeError: Upload worker crashed. Please check the logs.
contrastive_train-contrastive_train-1  | Traceback (most recent call last):
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "contrastive_train.py", line 62, in <module>
contrastive_train-contrastive_train-1  |     app()
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "contrastive_train.py", line 51, in train
contrastive_train-contrastive_train-1  |     run_trainer(
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/app/util/run.py", line 26, in run_trainer
contrastive_train-contrastive_train-1  |     trainer.fit()
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1386, in fit
contrastive_train-contrastive_train-1  |     self._train_loop()
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/trainer/trainer.py", line 1600, in _train_loop
contrastive_train-contrastive_train-1  |     self.engine.run_event(Event.BATCH_END)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/engine.py", line 239, in run_event
contrastive_train-contrastive_train-1  |     self._run_callbacks(event)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/engine.py", line 414, in _run_callbacks
contrastive_train-contrastive_train-1  |     cb.run_event(event, self.state, self.logger)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/core/callback.py", line 96, in run_event
contrastive_train-contrastive_train-1  |     return event_cb(state, logger)
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 294, in batch_end
contrastive_train-contrastive_train-1  |     self._check_workers()
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |   File "/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py", line 311, in _check_workers
contrastive_train-contrastive_train-1  |     raise RuntimeError('Upload worker crashed. Please check the logs.')
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  | RuntimeError: Upload worker crashed. Please check the logs.
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  | wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb:                                                                                 MB deduped)
contrastive_train-contrastive_train-1  | wandb: 
contrastive_train-contrastive_train-1  | wandb: Run history:
contrastive_train-contrastive_train-1  | wandb:               epoch ▁
contrastive_train-contrastive_train-1  | wandb:          loss/train ▅▁▅█▅▆▆▅▆▅▆▆▆▆▆▆▅▅▇▅▆▆▆▆▆▆▆▆▆▆▆▆▅▆▆▅▆▅▅▅
contrastive_train-contrastive_train-1  | wandb:     lr-AdamW/group0 ▁▂▃▄▆▆██████████████████████████████████
contrastive_train-contrastive_train-1  | wandb:      rank_zero_seed ▁
contrastive_train-contrastive_train-1  | wandb:         temperature ▁▁
contrastive_train-contrastive_train-1  | wandb:   trainer/batch_idx ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
contrastive_train-contrastive_train-1  | wandb: trainer/global_step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
contrastive_train-contrastive_train-1  | wandb:  trainer/grad_accum ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
contrastive_train-contrastive_train-1  | wandb:    wall_clock/total ▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇█████
contrastive_train-contrastive_train-1  | wandb:    wall_clock/train ▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇█████
contrastive_train-contrastive_train-1  | wandb:      wall_clock/val ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
contrastive_train-contrastive_train-1  | wandb: 
contrastive_train-contrastive_train-1  | wandb: Run summary:
contrastive_train-contrastive_train-1  | wandb:               epoch 0
contrastive_train-contrastive_train-1  | wandb:          loss/train 3.95309
contrastive_train-contrastive_train-1  | wandb:     lr-AdamW/group0 0.0005
contrastive_train-contrastive_train-1  | wandb:      rank_zero_seed 1971488380
contrastive_train-contrastive_train-1  | wandb:         temperature 0.07
contrastive_train-contrastive_train-1  | wandb:   trainer/batch_idx 62
contrastive_train-contrastive_train-1  | wandb: trainer/global_step 62
contrastive_train-contrastive_train-1  | wandb:  trainer/grad_accum 1
contrastive_train-contrastive_train-1  | wandb:    wall_clock/total 101.53677
contrastive_train-contrastive_train-1  | wandb:    wall_clock/train 101.53677
contrastive_train-contrastive_train-1  | wandb:      wall_clock/val 0.0
contrastive_train-contrastive_train-1  | wandb: 
contrastive_train-contrastive_train-1  | wandb: Synced 1661480783-olive-dragonfly: https://wandb.ai/vroomerify/app/runs/3qptaxw8
contrastive_train-contrastive_train-1  | wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
contrastive_train-contrastive_train-1  | wandb: Find logs at: ./wandb/run-20220826_022624-3qptaxw8/logs
contrastive_train-contrastive_train-1  | 
contrastive_train-contrastive_train-1  |                    | 62/3600 [02:04<42:06,  1.40ba/s, loss/train=3.9531]         
/root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/loggers/object_store_logger.py:436: RuntimeWarning: The following objects may not have been uploaded, likely due to a worker crash: 1661480783-olive-dragonfly/checkpoints/ep0-ba40-rank0
contrastive_train-contrastive_train-1  |   warnings.warn(
contrastive_train-contrastive_train-1  | /root/miniconda3/envs/video-rec/lib/python3.8/site-packages/composer/cli/launcher.py:223: UserWarning: AutoSelectPortWarning: The distributed key-value port was auto-selected. This may lead to race conditions when launching multiple training processes simultaneously. To eliminate this race condition, explicitly specify a port with --master_port PORT_NUMBER
contrastive_train-contrastive_train-1  |   warnings.warn('AutoSelectPortWarning: The distributed key-value port was auto-selected. '
contrastive_train-contrastive_train-1  | ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
contrastive_train-contrastive_train-1  | ERROR:composer.cli.launcher:Global rank 0 (PID 94) exited with code 1
contrastive_train-contrastive_train-1  | Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
contrastive_train-contrastive_train-1  | Global rank 0 (PID 94) exited with code 1
contrastive_train-contrastive_train-1  | ERROR conda.cli.main_run:execute(41): `conda run composer -n 2 contrastive_train.py train /config.yml /data` failed. (See above for error)

vedantroy avatar Aug 26 '22 02:08 vedantroy

Setting num_concurrent_uploads=1 doesn't help

vedantroy avatar Aug 26 '22 02:08 vedantroy

hey @vedantroy, we're looking into adding CloudFlare R2 as an ObjectStore. We will let you know when we have added it!

eracah avatar Oct 25 '22 17:10 eracah

hey @vedantroy, Cloudflare R2 is supported for checkpointing now. We added the support in https://github.com/mosaicml/composer/pull/2215 and https://github.com/mosaicml/composer/pull/1915

The docs are here

Thanks for your suggestion! Let us know if you have any questions!

eracah avatar May 26 '23 23:05 eracah