hatchet icon indicating copy to clipboard operation
hatchet copied to clipboard

[ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

Open aaronlewisblenchal opened this issue 1 year ago • 19 comments

Currently, I am using the following documentation for self hosting Hatchet: https://docs.hatchet.run/self-hosting I have followed the steps listed in the above documentation, but I'm not able to proceed due to the following error:

[ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

I have tried using the config that was pushed to main branch last week SERVER_GRPC_MAX_MSG_SIZE= 2147483648

But still the same error persists

@abelanger5 @grutt @steebchen

aaronlewisblenchal avatar Aug 07 '24 08:08 aaronlewisblenchal

Hi @aaronlewisblenchal, just to double check, which version of Hatchet docker image are you running?

abelanger5 avatar Aug 07 '24 11:08 abelanger5

Hey @abelanger5, I was earlier using v0.32.0 but realised your changes are merged with main in v0.40.0 Tried installing v0.40.0, but getting the below error

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/hatchet/hatchet-engine": stat /hatchet/hatchet-engine: no such file or directory: unknown

aaronlewisblenchal avatar Aug 07 '24 12:08 aaronlewisblenchal

Ah yeah we introduced this recently. Could you share how you're installing? I've confirmed that /hatchet/hatchet-engine should exist in the containers.

abelanger5 avatar Aug 07 '24 14:08 abelanger5

Yeah sure, I have followed the steps mentioned in the kubernetes quickstart. Pasting below the contents of the YAML file.


api:
  enabled: true
  image:
    repository: ghcr.io/hatchet-dev/hatchet/hatchet-api
    tag: v0.40.0
    pullPolicy: Always
  env:
    SERVER_AUTH_COOKIE_SECRETS: "<secret>"
    SERVER_ENCRYPTION_MASTER_KEYSET: "<keyset>"
    SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "<keyset>"
    SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "<keyset>"
engine:
  enabled: true
  image:
    repository: ghcr.io/hatchet-dev/hatchet/hatchet-engine
    tag: v0.40.0
    pullPolicy: Always
  env:
    SERVER_AUTH_COOKIE_SECRETS: "<secret>"
    SERVER_ENCRYPTION_MASTER_KEYSET: "<keyset>"
    SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "<keyset>"
    SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "<keyset>"
    

aaronlewisblenchal avatar Aug 07 '24 14:08 aaronlewisblenchal

I purged the above volumes and created new one's, now I am getting the below error.

hatchet-engine

k logs hatchet-engine-7d4648dbbc-6dvzs
2024/08/07 15:33:27 Loading server config from
2024/08/07 15:33:27 Shared file path: server.yaml
2024-08-07T15:33:28.071Z DBG Fetching security alerts for version v0.40.0 service=server
2024-08-07T15:33:28.079Z DBG Error fetching security alerts: ERROR: relation "SecurityCheckIdent" does not exist (SQLSTATE 42P01) service=server
2024-08-07T15:33:28.08Z DBG subscribing to queue: event_processing_queue_v2 service=events-controller
2024/08/07 15:33:28 engine failure: could not run with config: could not create rebalance controller partitions job: could not create engine partition: ERROR: relation "ControllerPartition" does not exist (SQLSTATE 42P01)

hatchet-postgres-db

024-08-07 15:52:46.901 GMT [1] LOG:  database system is ready to accept connections
2024-08-07 15:53:05.215 GMT [174] ERROR:  relation "_prisma_migrations" does not exist at character 28
2024-08-07 15:53:05.215 GMT [174] STATEMENT:  SELECT migration_name FROM _prisma_migrations ORDER BY started_at DESC LIMIT 1;
2024-08-07 15:55:29.814 GMT [384] ERROR:  relation "_prisma_migrations" does not exist at character 28
2024-08-07 15:55:29.814 GMT [384] STATEMENT:  SELECT migration_name FROM _prisma_migrations ORDER BY started_at DESC LIMIT 1;
2024-08-07 15:55:52.901 GMT [422] ERROR:  relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:52.901 GMT [422] STATEMENT:  -- name: GetSecurityCheckIdent :one
        SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:52.914 GMT [445] ERROR:  relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:52.914 GMT [445] STATEMENT:  -- name: GetSecurityCheckIdent :one
        SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:54.157 GMT [468] ERROR:  relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:54.157 GMT [468] STATEMENT:  -- name: GetSecurityCheckIdent :one
        SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:54.160 GMT [468] ERROR:  relation "ControllerPartition" does not exist at character 53
2024-08-07 15:55:54.160 GMT [468] STATEMENT:  -- name: CreateControllerPartition :one
        INSERT INTO "ControllerPartition" ("id", "createdAt", "lastHeartbeat")
        VALUES ($1::text, NOW(), NOW())
        ON CONFLICT DO NOTHING
        RETURNING id, "createdAt", "updatedAt", "lastHeartbeat"
2024-08-07 15:55:56.117 GMT [512] ERROR:  relation "SecurityCheckIdent" does not exist at character 52
2024-08-07 15:55:56.117 GMT [512] STATEMENT:  -- name: GetSecurityCheckIdent :one
        SELECT id FROM "SecurityCheckIdent" LIMIT 1
2024-08-07 15:55:56.121 GMT [512] ERROR:  relation "ControllerPartition" does not exist at character 53
2024-08-07 15:55:56.121 GMT [512] STATEMENT:  -- name: CreateControllerPartition :one
        INSERT INTO "ControllerPartition" ("id", "createdAt", "lastHeartbeat")
        VALUES ($1::text, NOW(), NOW())
        ON CONFLICT DO NOTHING
        RETURNING id, "createdAt", "updatedAt", "lastHeartbeat"
2024-08-07 15:55:56.141 GMT [514] LOG:  could not receive data from client: Connection reset by peer

aaronlewisblenchal avatar Aug 07 '24 16:08 aaronlewisblenchal

@abelanger5, please provide a solution for the hatchet-postgres-db issue.

chaitanyakoodoo avatar Aug 08 '24 13:08 chaitanyakoodoo

Hey @aaronlewisblenchal and @chaitanyakoodoo, the issue is the migration seems to have failed. There is a migration process that runs as part of the Helm upgrade. This can be tricky to catch because the default delete policy on the Helm hook removes the job almost immediately. If you set the following value in values.yaml you'll be able to catch the migration failure:

debug: true

Once you're able to see this container, could you share the logs from the migration container? It will be called something like hatchet-migration-xxxxx

abelanger5 avatar Aug 08 '24 13:08 abelanger5

Hey @abelanger5, thanks for your help. We have now installed v0.40.0, details of which are listed below:

NAME                                     IMAGE
caddy-5bdcc8d6f6-vcfht                            caddy:2.7.6-alpine
hatchet-engine-c94bc6bfb-kcjms              [ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.40.0)
hatchet-stack-api-7799dd9748-4zmml       [ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0)
hatchet-stack-api-7799dd9748-9nfnd       [ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-api:v0.40.0)
hatchet-stack-frontend-f5f98fcf8-dpv8x   [ghcr.io/hatchet-dev/hatchet/hatchet-frontend:v0.40.0](http://ghcr.io/hatchet-dev/hatchet/hatchet-frontend:v0.40.0)
hatchet-stack-postgres-0                           [docker.io/bitnami/postgresql:16.2.0-debian-12-r8](http://docker.io/bitnami/postgresql:16.2.0-debian-12-r8)
hatchet-stack-rabbitmq-0                          [docker.io/bitnami/rabbitmq:3.12.13-debian-12-r2](http://docker.io/bitnami/rabbitmq:3.12.13-debian-12-r2)

Also, we have added SERVER_GRPC_MAX_MSG_SIZE= 2147483648, but we are still receiving the below error upon running npm run worker:

🪓 31866 | 08/09/24, 03:46:50 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:46:55 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:00 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:05 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 31866 | 08/09/24, 03:47:10 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

Can you assist us with this?

aaronlewisblenchal avatar Aug 09 '24 10:08 aaronlewisblenchal

Hey @aaronlewisblenchal, apologies for that - I've just released v0.41.2 as latest, which has the env var. This was only available as a YAML config option in v0.40.0. Hopefully that fixes it!

abelanger5 avatar Aug 09 '24 11:08 abelanger5

Hey @abelanger5, no worries, we tried upgrading it to v0.41.2 as you suggested, but still receiving the error on running npm run worker

version: 1.0.0","time":"2024-08-09T13:48:32.305Z","v":0}
🪓 71136 | 08/09/24, 07:18:32 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:37 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:42 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:47 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)
🪓 71136 | 08/09/24, 07:18:52 PM  [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

Anything we might be missing here?

aaronlewisblenchal avatar Aug 09 '24 14:08 aaronlewisblenchal

I'm unable to reproduce this, I've just tested against larger payloads. Two things to confirm:

  1. You're running the engine container with v0.41.2? GRPC connects directly to the engine, not the API container.
  2. Do you have a proxy or ingress sitting in front of the engine service?

abelanger5 avatar Aug 09 '24 14:08 abelanger5

Hi @abelanger5, we have deployed all services in the v0.41.2 version, yes we have enabled ingress front of the engine. please let me know if you need any thing to review.

engine:
  enabled: true
  nameOverride: hatchet-engine
  fullnameOverride: hatchet-engine
  replicaCount: 1
  image:
    repository: "ghcr.io/hatchet-dev/hatchet/hatchet-engine"
    tag: "v0.41.2"
    pullPolicy: "Always"
  migrationJob:
    enabled: true
  service:
    externalPort: 7070
    internalPort: 7070
  commandline:
    command: ["/hatchet/hatchet-engine"]
  deployment:
    annotations:
      app.kubernetes.io/name: hatchet-engine
  serviceAccount:
    create: true
    name: hatchet-engine
  env:
    SERVER_AUTH_COOKIE_SECRETS: "secretvalue"
    SERVER_ENCRYPTION_MASTER_KEYSET: "secretvalue"
    SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET: "secretvalue"
    SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET: "secretvalue"
    SERVER_AUTH_COOKIE_INSECURE: "t"
    SERVER_AUTH_SET_EMAIL_VERIFIED: "t"
    SERVER_LOGGER_LEVEL: "debug"
    SERVER_LOGGER_FORMAT: "console"
    DATABASE_LOGGER_LEVEL: "debug"
    DATABASE_LOGGER_FORMAT: "console"
    SERVER_AUTH_GOOGLE_ENABLED: "f"
    SERVER_AUTH_BASIC_AUTH_ENABLED: "t"
    DATABASE_URL: "postgres://secretvalue:secretvalue@hatchet-stack-postgres:5432/hatchet?sslmode=disable"
    DATABASE_POSTGRES_HOST: "hatchet-stack-postgres"
    DATABASE_POSTGRES_PORT: "5432"
    DATABASE_POSTGRES_USERNAME: "secretvalue"
    DATABASE_POSTGRES_PASSWORD: "secretvalue"
    DATABASE_POSTGRES_DB_NAME: "hatchet"
    DATABASE_POSTGRES_SSL_MODE: "disable"
    SERVER_TASKQUEUE_RABBITMQ_URL: "amqp://hatchet:hatchet@hatchet-stack-rabbitmq:5672/"
    SERVER_AUTH_COOKIE_DOMAIN: "secretvalue.secretvalue.io"
    SERVER_URL: "https://secretvalue.secretvalue.io"
    SERVER_GRPC_BIND_ADDRESS: "0.0.0.0"
    SERVER_GRPC_INSECURE: "false"
    SERVER_GRPC_BROADCAST_ADDRESS: "secretvalue.secretvalue.io:443"
    SERVER_GRPC_MAX_MSG_SIZE: "2147483648"
  ingress:
    enabled: true
    ingressClassName: nginx
    annotations:
      nginx.ingress.kubernetes.io/ssl-redirect: "true"
      nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
      nginx.ingress.kubernetes.io/grpc-backend: "true"

    hosts:
      - host: secretvalue.secretvalue.io
        paths:
          - path: /
            backend:
              serviceName: hatchet-engine
              servicePort: 7070
    tls:
      - hosts:
          - secretvalue.secretvalue.io
        secretName: testcertificate

chaitanyakoodoo avatar Aug 12 '24 09:08 chaitanyakoodoo

@abelanger5, please provide a solution for this issue

version: 1.0.0","time":"2024-08-09T13:48:32.305Z","v":0} 🪓 71136 | 08/09/24, 07:18:32 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:37 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:42 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:47 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:52 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

chaitanyakoodoo avatar Aug 12 '24 12:08 chaitanyakoodoo

The config looks correct. A few follow-up questions:

  1. Could you share the output of kubectl describe <hatchet-engine-pod>? (with sensitive values redacted)
  2. I am also wondering if perhaps this is due to a max body size constraint on the NGINX grpc proxy. Perhaps you can try increasing the nginx max body size?
  3. The logs indicate that you are calling PutWorkflow with a nearly 2GB workflow. What is the use-case for why the workflow is such a large size?

abelanger5 avatar Aug 12 '24 12:08 abelanger5

@abelanger5 I also noticed that the endpoint https://engine..io/ returns '403 Forbidden' when I try to access it from a browser.

  1. output of kubectl describe
Name:             hatchet-engine-69dc85c95d-dwdp9
Namespace:        hatchet
Priority:         0
Service Account:  hatchet-engine
Node:             <secretvalue>
Start Time:       Tue, 13 Aug 2024 10:38:24 +0530
Labels:           app.kubernetes.io/instance=hatchet-stack
                  app.kubernetes.io/name=hatchet-engine
                  pod-template-hash=69dc85c95d
Annotations:      app.kubernetes.io/name: hatchet-engine
                  cni.projectcalico.org/containerID: 7133ad58076748106af8dcce7318413e00235ab1f6f191022fd7b08840fe7720
                  cni.projectcalico.org/podIP: 10.0.130.86/32
                  cni.projectcalico.org/podIPs: 10.0.130.86/32
Status:           Running
IP:               10.0.130.86
IPs:
  IP:           10.0.130.86
Controlled By:  ReplicaSet/hatchet-engine-69dc85c95d
Containers:
  engine:
    Container ID:  containerd://54c7178b0a665754621d4e6c4bbb27e68a215d5a103c2c70ec955d7c83e6e143
    Image:         ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest
    Image ID:      ghcr.io/hatchet-dev/hatchet/hatchet-engine@sha256:3bd98ea205d730b7435ed4426a7369a16548bc36d6df9bb758f745baa2281b52
    Port:          7070/TCP
    Host Port:     0/TCP
    Command:
      /hatchet/hatchet-engine
    State:          Running
      Started:      Tue, 13 Aug 2024 10:38:25 +0530
    Ready:          True
    Restart Count:  0
    Limits:
      memory:  1Gi
    Requests:
      cpu:      250m
      memory:   1Gi
    Liveness:   http-get http://:8733/live delay=60s timeout=1s period=5s #success=1 #failure=3
    Readiness:  http-get http://:8733/ready delay=20s timeout=1s period=5s #success=1 #failure=3
    Environment:
      DATABASE_LOGGER_FORMAT:                console
      DATABASE_LOGGER_LEVEL:                 debug
      DATABASE_POSTGRES_DB_NAME:             hatchet
      DATABASE_POSTGRES_HOST:                <secretvalue>
      DATABASE_POSTGRES_PASSWORD:            <secretvalue>
      DATABASE_POSTGRES_PORT:                5432
      DATABASE_POSTGRES_SSL_MODE:            disable
      DATABASE_POSTGRES_USERNAME:            <secretvalue>
      DATABASE_URL:                          postgres://<secretvalue>:<secretvalue>@hatchet-stack-postgres:5432/hatchet?sslmode=disable
      SERVER_AUTH_BASIC_AUTH_ENABLED:        t
      SERVER_AUTH_COOKIE_DOMAIN:             sandbox-hatchet.<secretvalue>.io
      SERVER_AUTH_COOKIE_INSECURE:           t
      SERVER_AUTH_COOKIE_SECRETS:            <secretvalue>
      SERVER_AUTH_GOOGLE_ENABLED:            f
      SERVER_AUTH_SET_EMAIL_VERIFIED:        t
      SERVER_ENCRYPTION_JWT_PRIVATE_KEYSET:  <secretvalue>
      SERVER_ENCRYPTION_JWT_PUBLIC_KEYSET:   <secretvalue>
      SERVER_ENCRYPTION_MASTER_KEYSET:       <secretvalue>
      SERVER_GRPC_BIND_ADDRESS:              0.0.0.0
      SERVER_GRPC_BROADCAST_ADDRESS:         engine.<secretvalue>.io:443
      SERVER_GRPC_INSECURE:                  false
      SERVER_GRPC_MAX_MSG_SIZE:              2147483648
      SERVER_LOGGER_FORMAT:                  console
      SERVER_LOGGER_LEVEL:                   debug
      SERVER_TASKQUEUE_RABBITMQ_URL:         amqp://<secretvalue>:<secretvalue>@hatchet-stack-rabbitmq:5672/
      SERVER_URL:                            https://sandbox-hatchet.<secretvalue>.io
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-nj562 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-nj562:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  86s   default-scheduler  Successfully assigned hatchet/hatchet-engine-69dc85c95d-dwdp9 to <secretvalue>
  Normal  Pulling    86s   kubelet            Pulling image "ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest"
  Normal  Pulled     85s   kubelet            Successfully pulled image "ghcr.io/hatchet-dev/hatchet/hatchet-engine:latest" in 426.680289ms (426.700126ms including waiting)
  Normal  Created    85s   kubelet            Created container engine
  Normal  Started    85s   kubelet            Started container engine

chaitanyakoodoo avatar Aug 13 '24 05:08 chaitanyakoodoo

Hey @abelanger5, Regarding point 3, the 2 GB Put workflow is not something we are passing and seems to be happening when initialising hatchet. However, the same seems to be working fine when we are testing it on SaaS version of hatchet.

aaronlewisblenchal avatar Aug 13 '24 07:08 aaronlewisblenchal

Thanks, I think I know what's happening - I'm pretty sure this limit is being set on the client-side, not the server (I've been trying to recreate with the Go client, but grpc clients are not consistent with this type of configuration). I'll test out the Typescript SDK later today and allow the value to be configurable there as well if it turns out to be the problem.

The very large payload is an issue, it would be good to track down why this is happening. Could you share a stubbed out version of the workflow that you're defining?

abelanger5 avatar Aug 13 '24 10:08 abelanger5

Hi @abelanger5 , I have tested quick setup steps, I am able to create a workflow using port forwarding, but when I try using ingress. I am getting the error, can you please send the correct configuration I need to add? https://docs.hatchet.run/self-hosting/kubernetes-quickstart

if possible can we connect on call to resolve the issue.. 🪓 71136 | 08/09/24, 07:18:32 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:37 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:42 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:47 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304) 🪓 71136 | 08/09/24, 07:18:52 PM [ERROR/Admin] Error: /WorkflowService/PutWorkflow RESOURCE_EXHAUSTED: Received message larger than max (1752460652 vs 4194304)

chaitanyakoodoo avatar Aug 13 '24 11:08 chaitanyakoodoo

Absolutely, you can grab a time with one of us here: https://cal.com/team/hatchet/founders

abelanger5 avatar Aug 13 '24 12:08 abelanger5

Closing this issue as I believe we've tracked down all causes of this error. For reference, this error shows up for the following reasons:

  1. Payloads are larger than 4MB, or a step is dependent on parent step outputs whose combined payload size is larger than 4MB. To avoid this, you can set the env var SERVER_GRPC_MAX_MSG_SIZE on the server. Depending on the client, a corresponding limit may need to be set on the SDKs. In the python SDK, this corresponds to HATCHET_GRPC_MAX_RECV_MESSAGE_LENGTH and HATCHET_GRPC_MAX_SEND_MESSAGE_LENGTH. I've created https://github.com/hatchet-dev/hatchet-typescript/issues/367 for the Typescript SDK.

  2. There is an issue with the SSL configuration between the client and server which can sometimes manifest as a RESOURCE_EXHAUSTED error. The most common cause of this seems to be a Cloudflare proxy which requires SSL, and most users have had luck turning off the CF proxy to avoid this issue. Other causes seem to be setting HATCHET_CLIENT_TLS_STRATEGY=none when TLS is actually required.

  3. A connection reset occurs from a different proxy, such as nginx-ingress, caused by an idle timeout when the worker does not receive messages for the proxy's client read timeout. All clients will automatically reconnect, so this shouldn't be an issue, except for a warning log in the console.

abelanger5 avatar Sep 23 '24 16:09 abelanger5

I am recieving the following error 2024/12/05 06:04:59 [error] 8530#8530: *11782395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 49.249.188.66, server: engine.hatchet.mydomain.com, request: "POST /WorkflowService/PutWorkflow HTTP/2.0", upstream: "grpc://10.244.1.222:7070", host: "engine.hatchet.mydomain.com" 49.249.188.66 - - [05/Dec/2024:06:04:59 +0000] "POST /WorkflowService/PutWorkflow HTTP/2.0" 502 150 "-" "grpc-python/1.67.1 grpc-c/44.0.0 (linux; chttp2)" 1555 0.003 [hatchet-hatchet-engine-7070] [] 10.244.1.222:7070 0 0.004 502 f9abc773a05054a71dccf219e13a293e. i have self hosted using helm , running nginx ingress and cloudflare DNS.

Is this issue related or similiar?

aswanthkrishna avatar Dec 05 '24 06:12 aswanthkrishna