ray [Data] Ray Data doesn't account for object store memory from object dtypes

What happened + What you expected to happen

I ran a batch inference workload where one of my UDFs returns rows with PIL Images. I observed spilling, and then my node failed.

The reason is that we use pd.DataFrame.memory_usage to compute the size of pandas blocks: https://github.com/ray-project/ray/blob/842bbcf4236e41f58d25058b0482cd05bfe9e4da/python/ray/data/_internal/pandas_block.py#L271-L273

But memory_usage() doesn't include the memory of object dtypes like PIL images:

The + symbol indicates that the true memory usage could be higher, because pandas does not count the memory used by values in columns with dtype=object.

https://pandas.pydata.org/docs/user_guide/gotchas.html#df-memory-usage

Versions / Dependencies

842bbcf4236e41f58d25058b0482cd05bfe9e4da

Reproduction script

You'll see estimated object store memory is less than 32 KiB, but you'll observe substantial object spilling.

import numpy as np

import ray


class Object:
    def __init__(self):
        # Each `Object` occupies >1 GiB of object store memory.
        self.data = np.zeros((1024 * 1024 * 1024), dtype=np.uint8)


def generate_data(row):
    return {"data": Object()}


ds = ray.data.range(100).map(generate_data)
for _ in ds.iter_batches(batch_size=None, batch_format="pandas"):
    pass

Issue Severity

High: It blocks me from completing my task.

Apr 08 '24 21:04 bveeramani

@raulchen FYI

Apr 08 '24 21:04 bveeramani

Seems potentially related to #44507.

Apr 11 '24 23:04 omatthew98

Just dropping this here to remind me/us to follow up with canva in this thread when this gets closed -- I got pinged that the pre-req was closed on the Ray Core side https://github.com/anyscale/product/issues/27659

Details: User: (stefan-canva) Channel: external-canva-anyscale Time: 2024-03-29 7:12:53 Original Thread Triage Thread

Message: Hi Anyscale team! I had this error for the first time just now:

tches(AestheticScorerWorker)))
At least one of the input arguments for this task could not be computed:
ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 379ce838985ae5efffffffffffffffffffffffff1400000002000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during ray start and `ray.init()```` Basically the job running from https://console.anyscale.com/o/canva-org/workspaces/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7/ses_sb4axbylswdarsfv64wmsetla9this workspace> hanged for a couple of minutes just before finishing, and then it threw this error. I used this https://console.anyscale.com/o/canva-org/workspaces/edit/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7?config=compute-configcompute config>, I'm now trying with https://console.anyscale.com/o/canva-org/workspaces/edit/expwrk_je8q4pmpmlpkz2bqqj33gxeiu7?config=compute-configa new one with num_cpus=0> on the head node in the hopes that this might help.

Would be great to get some more info how this can be prevented! Thanks

May 15 '24 15:05 alexr-anyscale

Now https://github.com/ray-project/ray/pull/45071 is merged, Data can use it to get object size

Jun 07 '24 21:06 rynewang

Verified this issue is now fixed by #45272 with the pyarrow object extension, we support saving objects directly in PyArrow tables, and size reporting is correct.

Jul 25 '24 20:07 raulchen

This issue isn't entirely fixed.

While it's true that this is no longer an issue if the blocks are Arrow table, you'll still run into the issue if the blocks are pandas tables. This can happen if you use the "pandas" batch format, or if you use APIs like drop_columns that use the "pandas" batch format under-the-hood.

Here's a simple repro:

import ray


def generate_data(batch):
    for _ in range(8):
        yield {"data": [[b"\x00" * 128 * 1024 * 1024]]}


ds = (
    ray.data.range(1, override_num_blocks=1)
    .map_batches(generate_data, batch_size=1)
    .map_batches(lambda batch: batch, batch_format=...)
)

for bundle in ds.iter_internal_ref_bundles():
    print(f"num_rows={bundle.num_rows()} size_bytes={bundle.size_bytes()}")

Output with pandas:

num_rows=8 size_bytes=192

Output with PyArrow:

num_rows=1 size_bytes=134217748                                                                                 
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                          
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748                                                                                                             
num_rows=1 size_bytes=134217748

Oct 22 '24 01:10 bveeramani

I'm going to create a new issue just to track the Pandas part in particular (and clean up this thread)

Nov 02 '24 00:11 richardliaw