[Data] Ray Data doesn't account for object store memory from object dtypes
What happened + What you expected to happen
I ran a batch inference workload where one of my UDFs returns rows with PIL Images. I observed spilling, and then my node failed.
The reason is that we use pd.DataFrame.memory_usage to compute the size of pandas blocks:
https://github.com/ray-project/ray/blob/842bbcf4236e41f58d25058b0482cd05bfe9e4da/python/ray/data/_internal/pandas_block.py#L271-L273
But memory_usage() doesn't include the memory of object dtypes like PIL images:
The + symbol indicates that the true memory usage could be higher, because pandas does not count the memory used by values in columns with dtype=object.
https://pandas.pydata.org/docs/user_guide/gotchas.html#df-memory-usage
Versions / Dependencies
842bbcf4236e41f58d25058b0482cd05bfe9e4da
Reproduction script
You'll see estimated object store memory is less than 32 KiB, but you'll observe substantial object spilling.
import numpy as np
import ray
class Object:
def __init__(self):
# Each `Object` occupies >1 GiB of object store memory.
self.data = np.zeros((1024 * 1024 * 1024), dtype=np.uint8)
def generate_data(row):
return {"data": Object()}
ds = ray.data.range(100).map(generate_data)
for _ in ds.iter_batches(batch_size=None, batch_format="pandas"):
pass
Issue Severity
High: It blocks me from completing my task.
@raulchen FYI
Seems potentially related to #44507.
Just dropping this here to remind me/us to follow up with canva in this thread when this gets closed -- I got pinged that the pre-req was closed on the Ray Core side https://github.com/anyscale/product/issues/27659
Details: User: (stefan-canva) Channel: external-canva-anyscale Time: 2024-03-29 7:12:53 Original Thread Triage Thread
Message: Hi Anyscale team! I had this error for the first time just now:
tches(AestheticScorerWorker)))
At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 379ce838985ae5efffffffffffffffffffffffff1400000002000000. To see information about where this ObjectRef was created in
Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and `ray.init()````
Basically the job running from https://console.anyscale.com/o/canva-org/workspaces/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7/ses_sb4axbylswdarsfv64wmsetla9this workspace> hanged for a couple of minutes just before finishing, and then it threw this error. I used this https://console.anyscale.com/o/canva-org/workspaces/edit/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7?config=compute-configcompute config>, I'm now trying with https://console.anyscale.com/o/canva-org/workspaces/edit/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7?config=compute-configa new one with num_cpus=0> on the head node in the hopes that this might help.
Would be great to get some more info how this can be prevented! Thanks
Now https://github.com/ray-project/ray/pull/45071 is merged, Data can use it to get object size
Verified this issue is now fixed by #45272 with the pyarrow object extension, we support saving objects directly in PyArrow tables, and size reporting is correct.
This issue isn't entirely fixed.
While it's true that this is no longer an issue if the blocks are Arrow table, you'll still run into the issue if the blocks are pandas tables. This can happen if you use the "pandas" batch format, or if you use APIs like drop_columns that use the "pandas" batch format under-the-hood.
Here's a simple repro:
import ray
def generate_data(batch):
for _ in range(8):
yield {"data": [[b"\x00" * 128 * 1024 * 1024]]}
ds = (
ray.data.range(1, override_num_blocks=1)
.map_batches(generate_data, batch_size=1)
.map_batches(lambda batch: batch, batch_format=...)
)
for bundle in ds.iter_internal_ref_bundles():
print(f"num_rows={bundle.num_rows()} size_bytes={bundle.size_bytes()}")
Output with pandas:
num_rows=8 size_bytes=192
Output with PyArrow:
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
num_rows=1 size_bytes=134217748
I'm going to create a new issue just to track the Pandas part in particular (and clean up this thread)