prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Orion stability downgrades with a large number task runs

Open BitTheByte opened this issue 1 year ago • 3 comments

First check

  • [X] I added a descriptive title to this issue.
  • [X] I used the GitHub search to find a similar issue and didn't find it.
  • [X] I searched the Prefect documentation for this issue.
  • [X] I checked that this issue is related to Prefect and not one of its dependencies.

Bug summary

Seems like Prefect Orion stability starts to downgrade when faced with a relatively moderate number of tasks ~30,000

  1. Flow graph endpoint fails completely with 500 error as it's unable to handle the number of graphs as shown at: -
https://app.prefect.cloud/account/a5aa4a76-f2a5-4e07-bb53-f0e7ff684f5b/workspace/f480ae44-ee5d-467a-ba06-85a65419eb76/flow-run/e4a20073-69f4-4a9f-8795-e0ca8a4529eb/radar
  1. For some reason task submitting and data retrieval time starts to increase and becomes unstable causing workflow to fail

As mapping is a core feature it's very likely that a large number of task submissions will occur

Reproduction

from prefect import flow, task

@task
def dummy(number):
    return number + 1

@flow
def dev_test():
    items = list(range(5_000_00))
    dummy.map(items)

Error

No response

Versions

Version:             2.3.2
API version:         0.8.0
Python version:      3.9.5
Git commit:          6e931ee9
Built:               Tue, Sep 6, 2022 12:36 PM
OS/Arch:             linux/x86_64
Profile:             default
Server type:         ephemeral
Server:
  Database:          sqlite
  SQLite version:    3.31.1

Additional context

No response

BitTheByte avatar Sep 11 '22 04:09 BitTheByte

Hi! Thanks for the issue.

  • Can you include your version output in your post?
  • Can you share a UTC timestamp for the 500 response?
  • What kind of workflow failures are you encountering? Can you share some tracebacks?

zanieb avatar Sep 12 '22 18:09 zanieb

@madkinsz This issue affects cloud, server, and ephemeral API

Can you include your version output in your post?

Description updated

Can you share a UTC timestamp for the 500 response?

Currently, I'm experiencing login issues with prefect cloud as Github Login is not working I'll update this section once I have the required information

What kind of workflow failures are you encountering? Can you share some tracebacks?

Mostly I face "Flow run as encountered unknown error" however the errors really ranges from "Even loop is closed", to 500 errors from the cloud which I can't provide the stack trace of or internal error due to missed up internal state

It can be easily reproduced with the code sample I provided it may require a run or two

BitTheByte avatar Sep 12 '22 18:09 BitTheByte

If you're encountering this with a local server, it'd be helpful if you shared the server-side logs when a 500 is returned. Similarly, tracebacks for flow run crashes with debug level logs enabled will be helpful. Once someone starts working on this we'll definitely use your example to reproduce it and ensure it's fixed, but including tracebacks here helps us narrow down the scope of the issue for triage and ensure that we're fixing the same bug that you're encountering.

zanieb avatar Sep 12 '22 18:09 zanieb