gitea icon indicating copy to clipboard operation
gitea copied to clipboard

[1.17.0] Context deadline exceeded (timeout) cloning large repo

Open parnic opened this issue 3 years ago • 4 comments

Description

Starting with v1.17.0, one of our large repos (70+ GB, using LFS) is failing trying to clone on some connections.

When the clone fails, the client reports some variation of:

error: 24576 bytes of body are still expected0 MiB | 1.83 MiB/s
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

and the server logs show, e.g.:

2022/08/05 03:27:25 ...ers/web/repo/http.go:484:serviceRPC() [E] [62ec8cb5-3] Fail to serve RPC(upload-pack) in /mnt/gitea/data/repositories/org/repo.git: context deadline exceeded -

I tracked this down to a change in commit 35fdefc1ff253522f101ffb1337437b59676c302 for #18363 where serviceRPC() was no longer allowed to run indefinitely, but carries a context with an unchangeable default timeout of 360 seconds (once I found this, I confirmed that our clones were dying around the 6 minute mark).

edit: collapsed this workaround because a PR is up with a different approach

We are locally using a custom build with this quick patch in order to set an extremely large timeout to allow us to work around the issue:

diff --git a/modules/setting/setting.go b/modules/setting/setting.go
index f5e624946..585c4589c 100644
--- a/modules/setting/setting.go
+++ b/modules/setting/setting.go
@@ -125,6 +125,7 @@ var (
        StartupTimeout       time.Duration
        PerWriteTimeout      = 30 * time.Second
        PerWritePerKbTimeout = 10 * time.Second
+       ServiceRpcTimeout    time.Duration
        StaticURLPrefix      string
        AbsoluteAssetURL     string

@@ -723,6 +724,7 @@ func loadFromConf(allowEmpty bool, extraConfig string) {
        StartupTimeout = sec.Key("STARTUP_TIMEOUT").MustDuration(0 * time.Second)
        PerWriteTimeout = sec.Key("PER_WRITE_TIMEOUT").MustDuration(PerWriteTimeout)
        PerWritePerKbTimeout = sec.Key("PER_WRITE_PER_KB_TIMEOUT").MustDuration(PerWritePerKbTimeout)
+       ServiceRpcTimeout = sec.Key("SERVICE_RPC_TIMEOUT").MustDuration(ServiceRpcTimeout)

        defaultAppURL := string(Protocol) + "://" + Domain
        if (Protocol == HTTP && HTTPPort != "80") || (Protocol == HTTPS && HTTPPort != "443") {
diff --git a/routers/web/repo/http.go b/routers/web/repo/http.go
index 6a85bca16..82835cb62 100644
--- a/routers/web/repo/http.go
+++ b/routers/web/repo/http.go
@@ -474,11 +474,12 @@ func serviceRPC(ctx gocontext.Context, h serviceHandler, service string) {
        cmd := git.NewCommand(h.r.Context(), service, "--stateless-rpc", h.dir)
        cmd.SetDescription(fmt.Sprintf("%s %s %s [repo_path: %s]", git.GitExecutable, service, "--stateless-rpc", h.dir))
        if err := cmd.Run(&git.RunOpts{
-               Dir:    h.dir,
-               Env:    append(os.Environ(), h.environ...),
-               Stdout: h.w,
-               Stdin:  reqBody,
-               Stderr: &stderr,
+               Dir:     h.dir,
+               Env:     append(os.Environ(), h.environ...),
+               Stdout:  h.w,
+               Stdin:   reqBody,
+               Stderr:  &stderr,
+               Timeout: setting.ServiceRpcTimeout,
        }); err != nil {
                if err.Error() != "signal: killed" {
                        log.Error("Fail to serve RPC(%s) in %s: %v - %s", service, h.dir, err, stderr.String())

I can make a PR out of this if this is desired, but I'd kind of rather this go back to the old behavior instead of adding a new 6-minute timeout footgun where, even if targeted logging was added to point people to correcting the failure, admins would need to find this config variable and change it for any decently-large repo. 1.16.x's behavior was much preferred.

Gitea Version

1.17.0

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

2.37.1

Operating System

Ubuntu 20.04 aarch64

How are you running Gitea?

Linux arm64 release on an AWS instance

Database

PostgreSQL

parnic avatar Aug 05 '22 04:08 parnic

Yes, please. And I think you can give a default timeout value.

lunny avatar Aug 05 '22 04:08 lunny

IMO there is no need to introduce another config option.

There is a context for serviceRPC, the timeout should be controlled by it. It might be possible to make Git's Run respect the context.Deadline() or have a long-enough timeout.

wxiaoguang avatar Aug 05 '22 05:08 wxiaoguang

IMO there is no need to introduce another config option.

I agree.

There is a context for serviceRPC, the timeout should be controlled by it. It might be possible to make Git's Run respect the context.Deadline()

I added some logging to Run(), and, in my limited testing, only a few contexts come in with a deadline set on them (serviceRPC's context has no deadline). Would it be better here to fall back to the context deadline whenever the given Timeout is <= 0 and skip setting a timeout at all if both of those are unspecified? I guess that wouldn't work since then we'd never be using this fallback default value. I think no matter what, serviceRPC() should be able to indicate that it doesn't want to be cut off just because a timeout expired. Maybe a fourth Run() variant that is timeout-free is appropriate?

or have a long-enough timeout.

To me, the problem with this is, as always, "how long is long enough?" 6 minutes may have seemed long enough, but when the scenario is "however long upload-pack takes to deliver an arbitrarily-sized repository through an arbitrarily-fast network connection," I'm not sure there can ever be a "long enough." It's especially problematic in this case because even if you choose something that seems "long enough," if it fails for someone there is no way for the admin to make changes to allow it to complete. I've set our timeout here to 24 hours, for example, but I can tell you that if my clone fails 24 hours in for no other reason than "there was a timeout set" and I have to start over again, depending on what part of the clone it's at, I'd be less-than-thrilled as a result.

parnic avatar Aug 05 '22 13:08 parnic

Yup, at the moment, the git.Run will use default timeout if Timeout<=0.

IMO there might be 2 methods:

  • Method 1: If there is a deadline of the use serviceRPC context, use it. If no, use a long-enough timeout (for example, 1 year) timeout.
  • Method 2: Introduce a new const like UseContextTimeout = -2. Then Run(Timeout: UseContextTimeout), if there is no deadline in the context, then do not use timeout when running the git command.

wxiaoguang avatar Aug 05 '22 13:08 wxiaoguang

So... what do I do if I'm getting this error?

Clone: context deadline exceeded

ScottBeeson avatar Nov 10 '22 16:11 ScottBeeson

https://docs.gitea.io/en-us/config-cheat-sheet/#git---timeout-settings-gittimeout

Adjust/set [git.timeout] MIGRATE appropriately.

zeripath avatar Nov 10 '22 19:11 zeripath