prefect icon indicating copy to clipboard operation
prefect copied to clipboard

Metadata files generated with RayTaskRunner

Open dqueruel-fy opened this issue 1 year ago โ€ข 6 comments

Bug summary

Issue description

I don't know if it's a bug or a desired behavior but some metadata files are generated each time I run my flows locally. That's annoying because the files are generated in my source directory (or from where I run the flows/tasks). I'd like to have more info, please, on what these files are and if we can generate it somewhere else or, ideally, not generate them at all.

It generates files with filenames like 89e55eaee58e8ce3567e87801196d9d5 in the same folder that I call the python script (see below) with the following content:

{
    "metadata": {
        "storage_key": "/Users/<path to my local source dir>/89e55eaee58e8ce3567e87801196d9d5",
        "expiration": null,
        "serializer": {
            "type": "pickle",
            "picklelib": "cloudpickle",
            "picklelib_version": null
        },
        "prefect_version": "3.1.2",
        "storage_block_id": null
    },
    "result": "gAVLAS4=\n"
}

The minimal reproducible python script is

from prefect import flow, task
from prefect_ray import RayTaskRunner

@task(log_prints=True, persist_result=True)
def taskA():
    print("Task A")
    return 1

@flow(log_prints=True, persist_result=True, task_runner=RayTaskRunner)
def myFlow():
    print("In my flow")
    taskA.submit().wait()
    return 0

myFlow()

Version info

Version:             3.1.2
API version:         0.8.4
Python version:      3.11.9
Git commit:          02b99f0a
Built:               Tue, Nov 12, 2024 1:38 PM
OS/Arch:             darwin/arm64
Profile:             local
Server type:         server
Pydantic version:    2.8.2
Integrations:
  prefect-ray:       0.4.2

Additional context

Some notes:

  • These files are not generated when I remove the RayTaskRunner or when I set persist_result to False .
  • I saw this files being generated when I upgraded prefect from 3.0.0rc14 to 3.1.1 in my code base, and I reproduced it in this minimal example.
  • I've tried to change the server config's PREFECT_LOCAL_STORAGE_PATH to /tmp/result but it didn't help
  • Screenshot of the flow and task ran from the minimal python code image

dqueruel-fy avatar Nov 13 '24 22:11 dqueruel-fy

Hey @dqueruel-fy - those files are a consequence of persisting task and flow results.

I've tried to change the server config's PREFECT_LOCAL_STORAGE_PATH to /tmp/result but it didn't help

This setting has an effect at workflow runtime and therefore setting it on the server will have no effect (all server configuration is prefixed with PREFECT_SERVER_). If you set this setting within the process that your workflows execute you should see the desired behavior.

For more information, check out the documentation on results and settings:

cicdw avatar Nov 13 '24 22:11 cicdw

hi @dqueruel-fy - yes this sounds like expected behavior, that metadata is your serialized result

ยป PREFECT_LOCAL_STORAGE_PATH=/tmp/result ipython

In [1]: from prefect import task

In [2]: @task(persist_result=True)
   ...: def f():
   ...:     return 42
   ...:

In [3]: f()
16:35:23.491 | INFO    | Task run 'f' - Finished in state Completed()
Out[3]: 42

In [4]: !ls /tmp/result
109c10d275731f842f4b08dd51b397aa

when you say

I've tried to change the server config's PREFECT_LOCAL_STORAGE_PATH to /tmp/result but it didn't help

... was about to type the same as @cicdw above, nevermind ๐Ÿ™‚

zzstoatzz avatar Nov 13 '24 22:11 zzstoatzz

@zzstoatzz @cicdw thanks for your quick answers ! I do understand that the files need to be generated but I don't understand why they are generated in my code base. My mention of PREFECT_LOCAL_STORAGE_PATH was probably misleading, I meant having the default value or this /tmp/result still produces the same issue (files generated in my code source).

These files are generated from where I call the python scripts.

Have you tried my minimal python script and and run it from let's say ~/Download, if you have the same behavior than me, you'll have new files generated in ~/Download.

I guess that the expected behavior is to have these files generated in the PREFECT_LOCAL_STORAGE_PATH not in the directory I call the script, right ?

dqueruel-fy avatar Nov 14 '24 14:11 dqueruel-fy

Ah I think I have a suspicion for what's going on! A few details (some of which are repetitive just for completeness sake):

  • whenever PREFECT_LOCAL_STORAGE_PATH is not set (and when there is no default storage block either), the default storage location for results is the present working directory as you've seen
  • this setting must be set on the client that executes the workflow to take affect
  • Ray uses multiple processes (or machines, but it sounds like you are running Ray locally on one machine) for distributing work
  • setting this as an environment variable in one runtime but not in the runtime of the Ray workers will cause any tasks executed on the workers to not pick up the setting

If my suspicion is correct that you are only setting this setting on the "parent" process that executes the flow and not on the Ray workers, the easiest solution is probably to use a .env file or prefect.toml file to persist this setting across all processes started in that directory.

cicdw avatar Nov 14 '24 16:11 cicdw

Thanks @cicdw for your insight !

So I've tested again with using prefect config set PREFECT_LOCAL_STORAGE_PATH The resulting server settings are

% prefect config view
๐Ÿš€ you are connected to:
http://127.0.0.1:4200
PREFECT_PROFILE=<profile>
PREFECT_API_URL='http://127.0.0.1:4200/api' (from profile)
PREFECT_LOCAL_STORAGE_PATH='/tmp/test' (from profile)

And when running my example scripts, I still have one file generated to /tmp/test (flow's one ? ) and one in my current working directory (task one ?).

I've also tried providing the env var to the RayTaskRunner like this but that didn't help.

@flow(log_prints=True, persist_result=True, task_runner=RayTaskRunner(init_kwargs={"runtime_env": {"env_vars":{"PREFECT_LOCAL_STORAGE_PATH": "/tmp/test"}}}))
def myFlow():
   ...

Could you provide more information on how to use the .envor the prefect.toml files please ?

dqueruel-fy avatar Nov 14 '24 20:11 dqueruel-fy

hi @dqueruel-fy - I am looking into this now (this seems like a bug).

it looks like when in ray, the task is unable to discover the parent context's result store and falls back to a default, relative path

will update hopefully soon!

I don't think prefect.toml helps in the context of this issue, but if you're generally curious I'd check this out.

zzstoatzz avatar Nov 14 '24 20:11 zzstoatzz