[Core] Function to determine in-object-store size of just-yielded object
Description
An API to determine the in-object-store size of an object that we just yielded.
This could be exposed as either a get_size_of_last_output API or a callback hook.
Use case
Ray Data accounts the size of objects to make scheduling decisions.
Currently, we use pd.DataFrame.memory_usage to estimate the size of data "blocks." However, this estimate can be inaccurate, and as a result Ray Data can make bad scheduling decisions (see https://github.com/ray-project/ray/issues/44577).
Another approach is to serialize "blocks" to estimate their size, but this is unperformant since we'd serialize the data twice (once to determine the size, and another time when we place it in the object store).
Having an API as described would enable Ray Data to make informed scheduling decisions with minimal performance overhead.
(Concretely, we'd use this API after line 425. b_out is the "block", and m_out is the associated metadata like size)
https://github.com/ray-project/ray/blob/9fb9d75208b1a2a36f48deff17ea9fa22d347c45/python/ray/data/_internal/execution/operators/map_operator.py#L419-L428
@jjyao let's review at next sprint planning
Hi @bveeramani could you explain why pd.DataFrame.memory_usage is inaccurate? e.g. do we have more mem usage than the data frame? It sounds like a bug on its own right.
@rynewang it's because pandas doesn't count memory usage from object dtypes:
The + symbol indicates that the true memory usage could be higher, because pandas does not count the memory used by values in columns with dtype=object.
For example, if you columns of strings or lists, pandas doesn't count that data at all. This can lead to pandas thinking a DataFrame is only KBs when in reality it's MBs or GBs.