dvc
dvc copied to clipboard
`dvc push` fails with "Timeout on reading data from socket" with HTTP storage
Bug Report
Description
When trying to push some moderately large amount of data (e.g. 1 GB) to a GitLab generic packages repository (via the HTTP remote storage type), the upload fails after a few (e.g. 5) minutes with the following timeout error:
$ dvc push -v
2022-07-28 17:33:07,919 DEBUG: Preparing to transfer data from '/home/user/dev/dvc-test/.dvc/cache' to 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,919 DEBUG: Preparing to collect status from 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,919 DEBUG: Collecting status from 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data'
2022-07-28 17:33:07,920 DEBUG: Querying 2 hashes via object_exists
2022-07-28 17:33:08,249 DEBUG: Preparing to collect status from '/home/user/dev/dvc-test/.dvc/cache'
2022-07-28 17:33:08,250 DEBUG: Collecting status from '/home/user/dev/dvc-test/.dvc/cache'
2022-07-28 17:33:08,331 DEBUG: Uploading '/home/user/dev/dvc-test/.dvc/cache/1c/90d4b78d043e2a58e0746231c9024f' to 'https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/1c/90d4b78d043e2a58e0746231c9024f'
2022-07-28 17:38:09,183 ERROR: failed to transfer 'md5: 1c90d4b78d043e2a58e0746231c9024f' - : Timeout on reading data from socket
------------------------------------------------------------
Traceback (most recent call last):
File "<site-packages>/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "<site-packages>/fsspec/implementations/http.py", line 294, in _put_file
async with meth(rpath, data=gen_chunks(), **kw) as resp:
File "<site-packages>/aiohttp_retry/client.py", line 98, in __aenter__
return await self._do_request()
File "<site-packages>/aiohttp_retry/client.py", line 86, in _do_request
raise e
File "<site-packages>/aiohttp_retry/client.py", line 71, in _do_request
response: ClientResponse = await self._request(
File "<site-packages>/aiohttp/client.py", line 559, in _request
await resp.start(conn)
File "<site-packages>/aiohttp/client_reqrep.py", line 898, in start
message, payload = await protocol.read() # type: ignore[union-attr]
File "<site-packages>/aiohttp/streams.py", line 616, in read
await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<site-packages>/dvc/data/transfer.py", line 25, in wrapper
func(fs_path, *args, **kwargs)
File "<site-packages>/dvc/data/transfer.py", line 162, in func
return dest.add(
File "<site-packages>/dvc/objects/db.py", line 126, in add
self._add_file(
File "<site-packages>/dvc/objects/db.py", line 98, in _add_file
return fs.utils.transfer(
File "<site-packages>/dvc/fs/utils.py", line 96, in transfer
_try_links(links, from_fs, from_path, to_fs, to_path)
File "<site-packages>/dvc/fs/utils.py", line 66, in _try_links
return _copy(from_fs, from_path, to_fs, to_path)
File "<site-packages>/dvc/fs/utils.py", line 44, in _copy
return to_fs.upload(from_path, to_path)
File "<site-packages>/dvc/fs/base.py", line 386, in upload
return self.put_file(from_info, to_info, callback=cb, size=size)
File "<site-packages>/dvc/fs/http.py", line 151, in put_file
super().put_file(
File "<site-packages>/dvc/fs/base.py", line 329, in put_file
self.fs.put_file(
File "<site-packages>/fsspec/asyn.py", line 85, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "<site-packages>/fsspec/asyn.py", line 63, in sync
raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError
------------------------------------------------------------
2022-07-28 17:38:09,197 ERROR: failed to push data to the cloud - 1 files failed to upload
------------------------------------------------------------
Traceback (most recent call last):
File "<site-packages>/dvc/repo/push.py", line 68, in push
pushed += self.cloud.push(
File "<site-packages>/dvc/data_cloud.py", line 85, in push
return transfer(
File "<site-packages>/dvc/data/transfer.py", line 176, in transfer
_do_transfer(
File "<site-packages>/dvc/data/transfer.py", line 116, in _do_transfer
raise FileTransferError(total_fails)
dvc.exceptions.FileTransferError: 1 files failed to transfer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<site-packages>/dvc/commands/data_sync.py", line 58, in run
processed_files_count = self.repo.push(
File "<site-packages>/dvc/repo/__init__.py", line 48, in wrapper
return f(repo, *args, **kwargs)
File "<site-packages>/dvc/repo/push.py", line 72, in push
raise UploadError(exc.amount)
dvc.exceptions.UploadError: 1 files failed to upload
Reproduce
- Create a new project on gitlab.com and take a note of the project ID.
- Create a deploy token under
Settings > Repository > Deploy tokens
with scopewrite_package_registry
. - Clone the project from gitlab.com locally and change the current working directory to the cloned project.
- Initialize DVC and configure the remote storage:
dvc init dvc remote add -d gitlab https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data dvc remote modify gitlab method PUT dvc remote modify gitlab auth custom dvc remote modify --local gitlab custom_auth_header 'DEPLOY-TOKEN' dvc remote modify --local gitlab password <deploy_token>
- Create some data and and track it with DVC:
mkdir data head -c 1G /dev/urandom > data/data.txt dvc add data/data.txt
- Push the data to the DVC remote storage:
dvc push -v
Expected
Successful upload of the data.
When I upload the same data to the GitLab generic packages repository via cURL, it works fine:
curl \
--header "DEPLOY-TOKEN: <deploy_token>" \
--upload-file data/data.txt \
https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 2.10.2 (snap)
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-48-generic-x86_64-with-glibc2.29
Supports:
azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
gdrive (pydrive2 = 1.10.1),
gs (gcsfs = 2022.3.0),
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
ssh (sshfs = 2022.3.1)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: https, https
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Additional Information (if any):
n/a
@sisp, the Gitlab generic packages repository is new to me, does it support uploads to arbitrary prefix?
Also, would it be possible to try uploading with fsspec
's HTTPFileSystem
and see if works? Thanks.
Thanks for taking the time to reply to my issue, @skshetry! :pray:
[...] does it support uploads to arbitrary prefix?
You've actually brought up a limitation of GitLab's generic packages repository for usage as a DVC storage backend that I hadn't realized at first. The generic packages repository - from a general point of view - has exactly 3 nesting levels:
- package name
- package version
- file name
That's because the URL path is structured like this:
/projects/<project_id>/packages/generic/<package_name>/<package_version>/<file_name>
When I use it as a DVC data storage (i.e. not run-cache storage, I'll get to that in a second), I define those 3 nesting levels as follows (according to the cache directory structure spec):
- package name ->
dvc-data
(hardcoded, arbitrarily chosen name) - package version -> first 2 characters of the file hash
- file name -> remaining characters of the file hash
When you asked about "arbitrary prefix", I stumbled over the run-cache structure spec and realized the run-cache may also be pushed which seems to have more nesting levels, so this might need some thinking - possibly a topic for a dedicated issue on using GitLab's generic packages repository as a storage type including why I think there are some nice benefits if this was officially possible.
Also, would it be possible to try uploading with
fsspec
'sHTTPFileSystem
and see if works?
It doesn't work either. Here's the code snippet for reproduction:
import fsspec
fs = fsspec.filesystem("http", headers={"DEPLOY-TOKEN": "<token>"})
with open("data/data.txt", "rb") as src:
fs.put_file(src, "https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt", method="PUT")
Running it gives the following error after (in my case) ~5 min:
$ python upload.py
Traceback (most recent call last):
File "<site-packages>/fsspec/asyn.py", line 53, in _runner
result[0] = await coro
File "<site-packages>/fsspec/implementations/http.py", line 291, in _put_file
async with meth(rpath, data=gen_chunks(), **kw) as resp:
File "<site-packages>/aiohttp/client.py", line 1138, in __aenter__
self._resp = await self._coro
File "<site-packages>/aiohttp/client.py", line 559, in _request
await resp.start(conn)
File "<site-packages>/aiohttp/client_reqrep.py", line 913, in start
self._continue = None
File "<site-packages>/aiohttp/helpers.py", line 721, in __exit__
raise asyncio.TimeoutError from None
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/dev/dvc/upload.py", line 7, in <module>
fs.put_file(src, "https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt", method="PUT")
File "<site-packages>/fsspec/asyn.py", line 111, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "<site-packages>/fsspec/asyn.py", line 94, in sync
raise FSTimeoutError from return_result
fsspec.exceptions.FSTimeoutError
@skshetry I've been doing some digging and think the error originates from the AIOHTTP client timeout configuration in dvc-objects
where the sock_read=self.REQUEST_TIMEOUT
setting seems to cause the timeout after exactly 5 minutes. Although self.REQUEST_TIMEOUT
is set to 60 seconds, I suspect the 5 minutes timeout results from retries, but I'm not 100% sure. In any case, when I set sock_read=None
then the 1 GB test data upload finishes successfully and doesn't time out after 5 minutes anymore. So empirically it appears to fix the problem. From my understanding, sock_read
specifies the time until data is read from the server which may take a while when uploading a large data file. So it makes sense to me to set sock_read=None
to disable that timeout.
What do you think? And can you confirm that my understanding of the internals is correct? If that's the cause and sock_read=None
is a valid fix, I'd be happy to send a PR to dvc-objects
.
@sisp, we added timeout because large downloads were freezing. See #7414 and #7460.
However, we are open to introducing timeout=0
option that disables any timeout. This will require support in http filesystem in dvc-objects and a new config option in dvc.
@skshetry Thanks for the pointers, I hadn't seen those issues. In that case, disabling all timeouts seems to be only a workaround for cases that don't suffer from the random hangs and it may not be robust. Nevertheless, it would at least enable large file uploads to HTTP remote storage when network conditions are good which simply doesn't work right now when the transfer time of a file exceeds 5 minutes.
I think introducing data blocks (see #829) could avoid the problem (in most cases, i.e. when a certain minimum upload bandwidth is available) as well. When each block is sufficiently small such that its transfer time is significantly less than 5 minutes, the timeout would not kick in. Besides, it might be more storage-efficient due to block-level deduplication.
That said, would you like the configurable timeout you suggested in any case? I could send some PRs that add this feature. After all, an implementation of #829 might not become available for a while.
@sisp, my understanding about sock_read
timeout is that it's a timeout waiting to read the data. Either the server is misconfigured, or there's some bug in aiohttp or your internet connection must be poor. There's no response for 60 seconds, which is strange.
How much time does it take if you don't set the timeout? Is it possible for you to try with small files and see if the issue still occurs? Also, can you try uploading with requests
and see if it also times out:
import requests
with open("data/data.txt", "rb") as src:
requests.put(
"https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
data=src,
headers={"DEPLOY-TOKEN": "<token>"},
timeout=60,
)
If you can create a PR for the timeouts, we can review and merge that, but I am not convinced it is the main issue here.
@skshetry When I don't set the timeout, it takes ~15 minutes, and the transfer succeeds then. I have an upload bandwidth of ~10 Mbit/s. When I upload smaller files which take less than 5 minutes to transfer per file, there is no problem even with read_sock=self.REQUEST_TIMEOUT
. When I run the upload script you posted that uses requests
, I get a timeout error after ~15 minutes:
$ python upload_requests.py
Traceback (most recent call last):
File "<site-packages>/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "<site-packages>/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/http/client.py", line 1371, in getresponse
response.begin()
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/http/client.py", line 319, in begin
version, status, reason = self._read_status()
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/http/client.py", line 280, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/socket.py", line 704, in readinto
return self._sock.recv_into(b)
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/ssl.py", line 1241, in recv_into
return self.read(nbytes, buffer)
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/ssl.py", line 1099, in read
return self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<site-packages>/requests/adapters.py", line 489, in send
resp = conn.urlopen(
File "<site-packages>/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "<site-packages>/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "<site-packages>/urllib3/packages/six.py", line 770, in reraise
raise value
File "<site-packages>/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "<site-packages>/urllib3/connectionpool.py", line 451, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "<site-packages>/urllib3/connectionpool.py", line 340, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='gitlab.com', port=443): Read timed out. (read timeout=60)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/user/dev/dvc-test/upload_requests.py", line 5, in <module>
requests.put(
File "<site-packages>/requests/api.py", line 130, in put
return request("put", url, data=data, **kwargs)
File "<site-packages>/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "<site-packages>/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "<site-packages>/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "<site-packages>/requests/adapters.py", line 578, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='gitlab.com', port=443): Read timed out. (read timeout=60)
Since requests
also fails, it might be a server-side issue. But cURL works. :thinking: Any suggestions for further debugging?
Not sure whether it's helpful, but when I upload the 1 GB file with cURL, this is the verbose output:
$ curl -v -i \
--header "DEPLOY-TOKEN: <deploy_token>" \
--upload-file data/data.txt \
https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt
* Trying 2606:4700:90:0:f22e:fbec:5bed:a9b9:443...
* TCP_NODELAY set
* Connected to gitlab.com (2606:4700:90:0:f22e:fbec:5bed:a9b9) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
* subject: C=US; ST=California; L=San Francisco; O=Cloudflare, Inc.; CN=gitlab.com
* start date: Jul 4 00:00:00 2022 GMT
* expire date: Oct 2 23:59:59 2022 GMT
* subjectAltName: host "gitlab.com" matched cert's "gitlab.com"
* issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
* SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x561a178e25f0)
> PUT /api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt HTTP/2
> Host: gitlab.com
> user-agent: curl/7.68.0
> accept: */*
> deploy-token: <deploy_token>
> content-length: 1073741824
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
* We are completely uploaded and fine
< HTTP/2 201
HTTP/2 201
< date: Thu, 04 Aug 2022 11:22:38 GMT
date: Thu, 04 Aug 2022 11:22:38 GMT
< content-type: application/json
content-type: application/json
< content-length: 25
content-length: 25
< cache-control: max-age=0, private, must-revalidate
cache-control: max-age=0, private, must-revalidate
< content-security-policy: default-src 'none'
content-security-policy: default-src 'none'
< etag: W/"6025f9a740871d09d0230776e3dbc7bf"
etag: W/"6025f9a740871d09d0230776e3dbc7bf"
< vary: Origin
vary: Origin
< x-content-type-options: nosniff
x-content-type-options: nosniff
< x-frame-options: SAMEORIGIN
x-frame-options: SAMEORIGIN
< x-request-id: 9b42ccd4fa4b83bc8e20e18455a6a620
x-request-id: 9b42ccd4fa4b83bc8e20e18455a6a620
< x-runtime: 0.612624
x-runtime: 0.612624
< strict-transport-security: max-age=31536000
strict-transport-security: max-age=31536000
< referrer-policy: strict-origin-when-cross-origin
referrer-policy: strict-origin-when-cross-origin
< gitlab-lb: fe-25-lb-gprd
gitlab-lb: fe-25-lb-gprd
< gitlab-sv: api-gke-us-east1-c
gitlab-sv: api-gke-us-east1-c
< cf-cache-status: DYNAMIC
cf-cache-status: DYNAMIC
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=n6pP%2FSYOkDh%2BknPkRth0TioYB8ffnOcVtSOLbxRA2wLm0TBHNH48tqx8eWSkccCCtoIxmllEWc4UeGxVfLM3gGGxalWm3yS0yEGMOCqnozJB6zocN5gkIqFyn51%2FnhdaEVNNNXLYP08%3D"}],"group":"cf-nel","max_age":604800}
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=n6pP%2FSYOkDh%2BknPkRth0TioYB8ffnOcVtSOLbxRA2wLm0TBHNH48tqx8eWSkccCCtoIxmllEWc4UeGxVfLM3gGGxalWm3yS0yEGMOCqnozJB6zocN5gkIqFyn51%2FnhdaEVNNNXLYP08%3D"}],"group":"cf-nel","max_age":604800}
< nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
nel: {"success_fraction":0.01,"report_to":"cf-nel","max_age":604800}
< server: cloudflare
server: cloudflare
< cf-ray: 7356d45c1cc29bce-FRA
cf-ray: 7356d45c1cc29bce-FRA
<
* Connection #0 to host gitlab.com left intact
{"message":"201 Created"}
~~I just noticed that the upload using requests
succeeded, i.e. the file is present in GitLab's generic packages repository, although a timeout exception was raised. :thinking:~~
@sisp, I don't see any response from the server in the above curl output.
@skshetry Sorry, I copied the output before cURL was actually done, somehow not thinking it would contain helpful information. :man_facepalming: I've edited the above comment with the server response output.
Could you please check if httpx solves this issue? Just trying to confirm if it is http1.1 issue (curl is using http/2):
# pip install httpx[http2]
import httpx
client = httpx.Client(http2=True)
with open("data/data.txt", "rb") as src:
r = client.put(
"https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
content=src,
headers={"DEPLOY-TOKEN": "<token>"},
timeout=60,
)
r.raise_for_status()
@skshetry The httpx
-based upload script also fails with a timeout exception after ~15 minutes:
$ python upload_httpx.py
Traceback (most recent call last):
File "<site-packages>/httpcore/_exceptions.py", line 8, in map_exceptions
yield
File "<site-packages>/httpcore/backends/sync.py", line 26, in read
return self._sock.recv(max_bytes)
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/ssl.py", line 1226, in recv
return self.read(buflen)
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/ssl.py", line 1101, in read
return self._sslobj.read(len)
socket.timeout: The read operation timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<site-packages>/httpx/_transports/default.py", line 60, in map_httpcore_exceptions
yield
File "<site-packages>/httpx/_transports/default.py", line 218, in handle_request
resp = self._pool.handle_request(req)
File "<site-packages>/httpcore/_sync/connection_pool.py", line 253, in handle_request
raise exc
File "<site-packages>/httpcore/_sync/connection_pool.py", line 237, in handle_request
response = connection.handle_request(request)
File "<site-packages>/httpcore/_sync/connection.py", line 90, in handle_request
return self._connection.handle_request(request)
File "<site-packages>/httpcore/_sync/http2.py", line 146, in handle_request
raise exc
File "<site-packages>/httpcore/_sync/http2.py", line 114, in handle_request
status, headers = self._receive_response(
File "<site-packages>/httpcore/_sync/http2.py", line 231, in _receive_response
event = self._receive_stream_event(request, stream_id)
File "<site-packages>/httpcore/_sync/http2.py", line 262, in _receive_stream_event
self._receive_events(request, stream_id)
File "<site-packages>/httpcore/_sync/http2.py", line 283, in _receive_events
events = self._read_incoming_data(request)
File "<site-packages>/httpcore/_sync/http2.py", line 343, in _read_incoming_data
raise exc
File "<site-packages>/httpcore/_sync/http2.py", line 329, in _read_incoming_data
data = self._network_stream.read(self.READ_NUM_BYTES, timeout)
File "<site-packages>/httpcore/backends/sync.py", line 26, in read
return self._sock.recv(max_bytes)
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/contextlib.py", line 137, in __exit__
self.gen.throw(typ, value, traceback)
File "<site-packages>/httpcore/_exceptions.py", line 12, in map_exceptions
raise to_exc(exc)
httpcore.ReadTimeout: The read operation timed out
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/user/dev/dvc-test/upload_httpx.py", line 5, in <module>
r = client.put(
File "<site-packages>/httpx/_client.py", line 1167, in put
return self.request(
File "<site-packages>/httpx/_client.py", line 815, in request
return self.send(request, auth=auth, follow_redirects=follow_redirects)
File "<site-packages>/httpx/_client.py", line 902, in send
response = self._send_handling_auth(
File "<site-packages>/httpx/_client.py", line 930, in _send_handling_auth
response = self._send_handling_redirects(
File "<site-packages>/httpx/_client.py", line 967, in _send_handling_redirects
response = self._send_single_request(request)
File "<site-packages>/httpx/_client.py", line 1003, in _send_single_request
response = transport.handle_request(request)
File "<site-packages>/httpx/_transports/default.py", line 218, in handle_request
resp = self._pool.handle_request(req)
File "/home/user/.pyenv/versions/3.9.13/lib/python3.9/contextlib.py", line 137, in __exit__
self.gen.throw(typ, value, traceback)
File "<site-packages>/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
raise mapped_exc(message) from exc
httpx.ReadTimeout: The read operation timed out
~~In contrast to the requests-based upload script, no file is present in GitLab's generic packages repository afterwards.~~
@sisp, my understanding about
sock_read
timeout is that it's a timeout waiting to read the data.
@skshetry I'm wondering, isn't it expected that this timeout kicks in when a file upload time exceeds the timeout? After the connection to the server has been established, there is no read happening until the upload has completed. Right? Perhaps the reason why cURL works fine is that there seems to be no read timeout set or even available for cURL.
How about the following change to dvc-objects
? When DVC downloads files, sock_read
remains as it is now. When DVC uploads files, sock_read=None
is set because the delay until data will be read may be long. #7414 and #7460 are only about download issues, so I'd imagine this fix should not affect them.
@skshetry In addition to my previous suggestion, or as an alternative one, how about adding support for chunked transfer encoding? I've adapted the requests
-based upload script to use chunked transfer encoding and it works fine with the timeout setting:
import requests
with open("data/data.txt", "rb") as src:
def gen(chunk_size):
chunk = src.read(chunk_size)
while chunk:
yield chunk
chunk = src.read(chunk_size)
requests.put(
"https://gitlab.com/api/v4/projects/<project_id>/packages/generic/dvc-data/test/data.txt",
data=gen(1024 * 1024 * 32), # 32 MB
headers={"DEPLOY-TOKEN": "<deploy_token>"},
timeout=60,
)
Not every HTTP remote storage may support chunked transfer encoding or it may not be preferred for some reason, so we could make it configurable. The default would be as it is now.
One downside of chunked transfer encoding seems to be that HTTP/2 doesn't support it, so it sounds like it's not future-proof:
Note: HTTP/2 doesn't support HTTP 1.1's chunked transfer encoding mechanism, as it provides its own, more efficient, mechanisms for data streaming.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Transfer-Encoding
Just to let you know that I have been having these problems lately, which are really slowing down our team since we cannot rely on dvc. We are not using github packages, but dagshub remotes. The issue is even worse with 2.16.0, since in that version it does not even attempt to push the data to the remote. It seems that everything starts with release 2.10.0 where the timeout was introduced.
https://github.com/iterative/dvc/pull/7460#discussion_r826173013
If you repeat the process with dvc 2.9.5, everything works as expected.
The problem I am having is that dvc does not complain. It just finishes, and when I look at the remote, the file is smaller than expected. This leads to deep trouble, since the md5 is there in the remote and the only option to modify it is to use curl and manually post the right file to the corresponding md5 url.
To me it is much more important that when I push something, it is correctly pushed, than having some hangs now and then while downloading. I strongly support removing the sock_read timeout. In my team we are back to 2.9.5 just for this reason.
@skshetry In addition to my previous suggestion, or as an alternative one, how about adding support for chunked transfer encoding?
We already do chunking uploads. See https://github.com/fsspec/filesystem_spec/blob/5d6c87e562396f352645df7cd555a2b4b6eec84a/fsspec/implementations/http.py#L274-L278. I was wondering if it was chunking uploads causing timeouts, hence I asked to check with httpx, but that also fails.
Just to let you know that I have been having these problems lately, which are really slowing down our team since we cannot rely on dvc. We are not using github packages, but dagshub remotes. The issue is even worse with 2.16.0, since in that version it does not even attempt to push the data to the remote. It seems that everything starts with release 2.10.0 where the timeout was introduced.
@cerquide, Does uploads work correctly if you set sock_read=None
? I think it's a different issue than timeout.
It does if you go back to dvc 2.9.5. Have not checked specifically to change sock_read=None in a higher version
We are not using github packages [...]
@cerquide GitLab, but I think that's what you meant. Just making sure we're talking about the same thing. :wink:
The issue is even worse with 2.16.0, since in that version it does not even attempt to push the data to the remote.
@cerquide That's likely related to https://github.com/iterative/dvc-objects/pull/116. It should be fixed in the next DVC release.
We already do chunking uploads. [...]
@skshetry Ah, I hadn't noticed that. Strange that it works with requests
and chunked transfer encoding but not with aiohttp
. :thinking:
@cerquide GitLab, but I think that's what you meant. Just making sure we're talking about the same thing. wink
Ooops sorry, you are right, fortunately I was saying we are not using them, it would have been worst if I said we are experts.
@cerquide, would it be possible to try with sock_read=None
and see if it fixes the issue?
@cerquide, would it be possible to try with sock_read=None and see if it fixes the issue?
@skshetry, in my case (using dagshub https remote), changing to sock_read=None
fixes the issue both for dvc 2.10.2 and for dvc 2.15.0 (in this case the file has moved to fs/implementations/http.py). With dvc 2.16.0 I think @sisp is right and I am suffering from a different disease.
@cerquide, can you try installing dvc-objects==0.1.7
with dvc==2.16.0
, and see if sock_read=None
fixes the issue:
pip install dvc-objects==0.1.7
This will complain that it is incompatible, but you can ignore for now. Thanks.
@skshetry If I read the requests
library code right, there is no read timeout for chunked requests. I've tried understanding how aiohttp
's chunked transfer encoding works for requests, but it looks a bit convoluted at first glance. I'm not deeply familiar with the exact protocol of chunked transfer encoding, but I wonder whether the read timeout should be reset after each successful chunk. I think aiohttp
doesn't do that, but I also don't know whether that would actually be correct.
@cerquide, can you try installing dvc-objects==0.1.7 with dvc==2.16.0, and see if sock_read=None fixes the issue:
@skshetry, with sock_read=None
pushing works like a charm for me with dvc-objects==0.1.7
and dvc==2.16.0
.
When pulling, for some of the big files I received errors like
ERROR: failed to transfer 'md5: 57bba9a44a71691b2bb2ec11159f55e7' - Response payload is not completed
but this was maybe due to the number of jobs being too high, with -j2 it worked flawlessly.
DVC over http works like charm, but only for small files. When I try to push a bigger file (say 20MB), I get Timeout on reading data from socket
. I can see was an issue on github and it should be fixed for dvc-objects==0.1.7
, however, I have dvc_objects = 0.7.0
and the problem occurs.
2022-10-27 13:22:39,243 ERROR: failed to transfer 'md5: 07d00c327f993560a68ba6c02127320d' - : Timeout on reading data from socket
------------------------------------------------------------
Traceback (most recent call last):
File "fsspec/asyn.py", line 53, in _runner
File "fsspec/implementations/http.py", line 297, in _put_file
File "aiohttp_retry/client.py", line 149, in __aenter__
File "aiohttp_retry/client.py", line 134, in _do_request
File "aiohttp_retry/client.py", line 100, in _do_request
File "aiohttp/client.py", line 560, in _request
File "aiohttp/client_reqrep.py", line 899, in start
File "aiohttp/streams.py", line 616, in read
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "dvc_data/hashfile/transfer.py", line 40, in wrapper
File "dvc_data/hashfile/transfer.py", line 202, in func
File "dvc_data/hashfile/db/__init__.py", line 105, in add
File "dvc_objects/db.py", line 136, in add
File "dvc_objects/fs/generic.py", line 132, in transfer
File "dvc_objects/fs/generic.py", line 93, in _try_links
File "dvc_objects/fs/generic.py", line 69, in copy
File "dvc_objects/fs/callbacks.py", line 58, in func
File "dvc_objects/fs/callbacks.py", line 41, in wrapped
File "dvc_http/__init__.py", line 146, in put_file
File "dvc_objects/fs/base.py", line 412, in put_file
File "fsspec/asyn.py", line 111, in wrapper
File "fsspec/asyn.py", line 94, in sync
fsspec.exceptions.FSTimeoutError
------------------------------------------------------------
2022-10-27 13:22:39,344 ERROR: failed to push data to the cloud - 1 files failed to upload
------------------------------------------------------------
Traceback (most recent call last):
File "dvc/commands/data_sync.py", line 59, in run
File "dvc/repo/__init__.py", line 48, in wrapper
File "dvc/repo/push.py", line 117, in push
dvc.exceptions.UploadError: 1 files failed to upload
------------------------------------------------------------
We should probably get to implementing the configuration of timeout.
Original report from @michuhu: https://discord.com/channels/485586884165107732/563406153334128681/1035466365659779093