Metadata files generated with RayTaskRunner
Bug summary
Issue description
I don't know if it's a bug or a desired behavior but some metadata files are generated each time I run my flows locally. That's annoying because the files are generated in my source directory (or from where I run the flows/tasks). I'd like to have more info, please, on what these files are and if we can generate it somewhere else or, ideally, not generate them at all.
It generates files with filenames like 89e55eaee58e8ce3567e87801196d9d5 in the same folder that I call the python script (see below) with the following content:
{
"metadata": {
"storage_key": "/Users/<path to my local source dir>/89e55eaee58e8ce3567e87801196d9d5",
"expiration": null,
"serializer": {
"type": "pickle",
"picklelib": "cloudpickle",
"picklelib_version": null
},
"prefect_version": "3.1.2",
"storage_block_id": null
},
"result": "gAVLAS4=\n"
}
The minimal reproducible python script is
from prefect import flow, task
from prefect_ray import RayTaskRunner
@task(log_prints=True, persist_result=True)
def taskA():
print("Task A")
return 1
@flow(log_prints=True, persist_result=True, task_runner=RayTaskRunner)
def myFlow():
print("In my flow")
taskA.submit().wait()
return 0
myFlow()
Version info
Version: 3.1.2
API version: 0.8.4
Python version: 3.11.9
Git commit: 02b99f0a
Built: Tue, Nov 12, 2024 1:38 PM
OS/Arch: darwin/arm64
Profile: local
Server type: server
Pydantic version: 2.8.2
Integrations:
prefect-ray: 0.4.2
Additional context
Some notes:
- These files are not generated when I remove the RayTaskRunner or when I set
persist_resulttoFalse. - I saw this files being generated when I upgraded prefect from
3.0.0rc14to3.1.1in my code base, and I reproduced it in this minimal example. - I've tried to change the server config's
PREFECT_LOCAL_STORAGE_PATHto/tmp/resultbut it didn't help - Screenshot of the flow and task ran from the minimal python code
Hey @dqueruel-fy - those files are a consequence of persisting task and flow results.
I've tried to change the server config's PREFECT_LOCAL_STORAGE_PATH to /tmp/result but it didn't help
This setting has an effect at workflow runtime and therefore setting it on the server will have no effect (all server configuration is prefixed with PREFECT_SERVER_). If you set this setting within the process that your workflows execute you should see the desired behavior.
For more information, check out the documentation on results and settings:
hi @dqueruel-fy - yes this sounds like expected behavior, that metadata is your serialized result
ยป PREFECT_LOCAL_STORAGE_PATH=/tmp/result ipython
In [1]: from prefect import task
In [2]: @task(persist_result=True)
...: def f():
...: return 42
...:
In [3]: f()
16:35:23.491 | INFO | Task run 'f' - Finished in state Completed()
Out[3]: 42
In [4]: !ls /tmp/result
109c10d275731f842f4b08dd51b397aa
when you say
I've tried to change the server config's PREFECT_LOCAL_STORAGE_PATH to /tmp/result but it didn't help
... was about to type the same as @cicdw above, nevermind ๐
@zzstoatzz @cicdw thanks for your quick answers ! I do understand that the files need to be generated but I don't understand why they are generated in my code base. My mention of PREFECT_LOCAL_STORAGE_PATH was probably misleading, I meant having the default value or this /tmp/result still produces the same issue (files generated in my code source).
These files are generated from where I call the python scripts.
Have you tried my minimal python script and and run it from let's say ~/Download, if you have the same behavior than me, you'll have new files generated in ~/Download.
I guess that the expected behavior is to have these files generated in the PREFECT_LOCAL_STORAGE_PATH not in the directory I call the script, right ?
Ah I think I have a suspicion for what's going on! A few details (some of which are repetitive just for completeness sake):
- whenever
PREFECT_LOCAL_STORAGE_PATHis not set (and when there is no default storage block either), the default storage location for results is the present working directory as you've seen - this setting must be set on the client that executes the workflow to take affect
- Ray uses multiple processes (or machines, but it sounds like you are running Ray locally on one machine) for distributing work
- setting this as an environment variable in one runtime but not in the runtime of the Ray workers will cause any tasks executed on the workers to not pick up the setting
If my suspicion is correct that you are only setting this setting on the "parent" process that executes the flow and not on the Ray workers, the easiest solution is probably to use a .env file or prefect.toml file to persist this setting across all processes started in that directory.
Thanks @cicdw for your insight !
So I've tested again with using prefect config set PREFECT_LOCAL_STORAGE_PATH
The resulting server settings are
% prefect config view
๐ you are connected to:
http://127.0.0.1:4200
PREFECT_PROFILE=<profile>
PREFECT_API_URL='http://127.0.0.1:4200/api' (from profile)
PREFECT_LOCAL_STORAGE_PATH='/tmp/test' (from profile)
And when running my example scripts, I still have one file generated to /tmp/test (flow's one ? ) and one in my current working directory (task one ?).
I've also tried providing the env var to the RayTaskRunner like this but that didn't help.
@flow(log_prints=True, persist_result=True, task_runner=RayTaskRunner(init_kwargs={"runtime_env": {"env_vars":{"PREFECT_LOCAL_STORAGE_PATH": "/tmp/test"}}}))
def myFlow():
...
Could you provide more information on how to use the .envor the prefect.toml files please ?
hi @dqueruel-fy - I am looking into this now (this seems like a bug).
it looks like when in ray, the task is unable to discover the parent context's result store and falls back to a default, relative path
will update hopefully soon!
I don't think prefect.toml helps in the context of this issue, but if you're generally curious I'd check this out.