modin suggestions on handling out of memory matrix operation on large dataset

One of my task is to compute a matrix transformation on large feature vectors, for example, I have a 20K x 60K size DataFrame df and a square matrix 60K x 60K DataFrame mat_df, and I will call df.dot(mat_df). Because such large matrix operation cannot work on single machine, I was hoping that distributed dataframe could help and even provide better scalability if I want to work on even larger dataset.

I am running on Ray Clusters, and my system is configured as the following:

Head,16 cores, 64GB memory, >100GB disk space
1. --object-store-memory 15000000000 # set object store to 15GB
2. --system-config='{"object_spilling_config":"{"type":"filesystem","params":{"directory_path":"/mnt/nas/tmp/spill"}}"}' # I have mounted 8TB NAS and use it for spilled objects.
3. --num-cpus 2 # I tried to limited the cores/workers to avoid OOM issue
raylet, 16 cores, 64GB memory, >100GB disk space 5. --object-store-memory 10000000000 # set object store to 10GB 6. --num-cpus=4 # limited cores/workers to avoid OOM.
Modin envs: 8. MODIN_EXPERIMENTAL_GROUPBY=True # I need to enable this to make my apply(func, axis=1) to work, otherwise, my func cannot access all of the columns. 9. MODIN_ENGINE=ray # ray engine 10. MODIN_RAY_CLUSTER=True # indicate that the client code is running on Ray Cluster 11. MODIN_NPARTITIONS=2 # another attempt to avoid OOM
due to environment constrain, I am currently running on Python 3.8.10

INSTALLED VERSIONS
------------------
commit           : b5545c686751f4eed6913f1785f9c68f41f4e51d
python           : 3.8.10.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.15.0-86-generic
Version          : #96~20.04.1-Ubuntu SMP Thu Sep 21 13:23:37 UTC 2023
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

Modin dependencies
------------------
modin            : 0.23.1
ray              : 2.7.1
dask             : 2023.5.0
distributed      : 2023.5.0
hdk              : None

pandas dependencies
-------------------
pandas           : 2.0.3
numpy            : 1.24.4
pytz             : 2023.3.post1
dateutil         : 2.8.2
setuptools       : 56.0.0
pip              : 23.3
Cython           : None
pytest           : 7.4.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : None
pandas_datareader: None
bs4              : 4.12.2
bottleneck       : None
brotli           : None
fastparquet      : None
fsspec           : 2023.9.2
gcsfs            : None
matplotlib       : 3.7.3
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 13.0.0
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.1
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None

my running example is as the following:

import modin.pandas as pd
import numpy as np

n_features = 60000
n_samples = 20000
features = [f'f{i+1}' for i in range(n_features)]
samples = [f's{i+1}' for i in range(n_samples)]

df = pd.DataFrame(np.random.rand(n_samples, n_features), columns=features, index=samples)
mat_df = pd.DataFrame(np.random.rand(n_features, n_features), columns=features, index=features) 
res_df = df.dot(mat_df)

The Python program will likely be killed during dot():

>>> df = df.dot(mat_df)
(raylet) Spilled 66383 MiB, 28 objects, write throughput 64 MiB/s.
Killed

Some warning messages that I've seen: object spilling:

(raylet) Spilled 9156 MiB, 5 objects, write throughput 75 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.

killing worker(s):

(raylet) [2023-10-24 09:56:57,431 E 390581 390581] (raylet) node_manager.cc:3007: 1 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: a343585c2e9b086071e513b714c99c79de4badb6303e52e3f44985d2, IP: <raylet_addr>) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip <raylet_addr>`

object spilling request failure:

(raylet) [2023-10-24 09:57:00,615 E 390581 390581] (raylet) local_object_manager.cc:360: Failed to send object spilling request: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details:

worker not registering on time:

(raylet) [2023-10-24 10:04:30,463 E 390581 390581] (raylet) worker_pool.cc:548: Some workers of the worker process(393356) have not registered within the timeout. The process is still alive, probably it's hanging during start.

So far I have tried to 1) limit workers, 2) limit partitions and 3) set less object store memory to try to utilize object spilling, but I still could not resolve the OOM issue. At this point, I would like to hear your opinions and suggestions on how I could handle the OOM problem on matrix operation by either tweaking my ray/modin configurations or improve my code, or is it feasible at all to do on my setups in the first place, if so, how do I determine the limit of the size of my feature vectors and resources to do the 60K one?

Oct 24 '23 14:10 SiRumCz

Hello @SiRumCz! Have you tried the opposite, increasing storage? Could you try increase up to 60% of host RAM (as Modin does by default for one host). Without setting MODIN_NPARTITIONS.

Oct 25 '23 14:10 anmyachev

Hello @SiRumCz! Have you tried the opposite, increasing storage? Could you try increase up to 60% of host RAM (as Modin does by default for one host). Without setting MODIN_NPARTITIONS.

do you mean this? storage – [Experimental] Specify a URI for persistent cluster-wide storage. This storage path must be accessible by all nodes of the cluster, otherwise an error will be raised. This option can also be specified as the RAY_STORAGE env var.

Oct 25 '23 14:10 SiRumCz

No, I mean using --object-store-memory

Oct 25 '23 15:10 anmyachev

@anmyachev Yes, I have tried from range 1GB to all the way 60GB, they all seem to behave similarly except program will hang on 1GB, with and without setting MODIN_NPARTITIONS.

Oct 25 '23 15:10 SiRumCz

@SiRumCz consider this setup:

--object-store-memory 38000000000 (for both nodes)
--num-cpus 4 (for both nodes)
MODIN_NPARTITIONS=32

The main way to limit memory usage now is to reduce the number of CPU and increase MODIN_NPARTITIONS (which also reduces concurrency).

Please also note: the current dot implementation is not the most efficient, since the right operand is materialized into a pandas dataframe and passed in its entirety to each process. This is not so bad because it uses shared memory, but in memory intensive operations such as dot this it greatly increases the simultaneous use of RAM.

Oct 25 '23 19:10 anmyachev

@anmyachev I have tried the configs are you suggested, unfortunately it will still get killed. I don't quite get why increasing MODIN_NPARTITIONS would help reduce concurrency, from the doc, I thought if I increase the value, it would also increase how many tasks to be assigned to the workers.

This is not so bad because it uses shared memory One thing I noticed is that when doing the dot operation, the process seems to put all the workloads on one node, so I could see that one of my node is using almost 100% of the memory, while the other node just doing some light works with lots of free memory. I wonder if there's a way to utilize all of my nodes and the memories.

Oct 27 '23 02:10 SiRumCz

I don't quite get why increasing MODIN_NPARTITIONS would help reduce concurrency, from the doc, I thought if I increase the value, it would also increase how many tasks to be assigned to the workers.

@SiRumCz although this variable increases the number of partitions that are processed separately, the number of partitions that can be processed in parallel is limited by the number of CPUs.

An example of what it looks like to process 32 partitions on 2 CPUs (Obtained using ray.timeline function):

Oct 30 '23 13:10 anmyachev

@anmyachev thanks for the explanation. What about the memory usage with Ray clusters, Modin doesn't seem to utilize both of the memories when doing the dot operation on dataframes, because from my observation, only one node is being used heavily while the other node just idle with lots of available memory, is it possible for me to also utilize these?

Oct 30 '23 18:10 SiRumCz

@anmyachev thanks for the explanation. What about the memory usage with Ray clusters, Modin doesn't seem to utilize both of the memories when doing the dot operation on dataframes, because from my observation, only one node is being used heavily while the other node just idle with lots of available memory, is it possible for me to also utilize these?

Please also note: the current dot implementation is not the most efficient, since the right operand is materialized into a pandas dataframe and passed in its entirety to each process.

@SiRumCz now I am sure that OOM error appears at the moment of materialization of the right operand (memory consumption reaches more than 100GB). If the code passed this point, then the load for performing the dot operation would have to be divided equally between your nodes.

So this is not a fundamental problem with running code on a cluster, but a flaw in the implementation of the dot operation. We will see what can be done about this.

Oct 30 '23 19:10 anmyachev

@anmyachev Happy new year! Just to check-in, any updates on this Issue?

Jan 02 '24 14:01 SiRumCz

@anmyachev Happy new year! Just to check-in, any updates on this Issue?

@SiRumCz Happy new year! :)

Unfortunately, there are no updates at the moment. I would like to try to get rid of the materialization of the right operand, this should reduce memory consumption and give a performance boost as well, but I haven’t had the free time yet.

Jan 08 '24 16:01 anmyachev

modin modin copied to clipboard

suggestions on handling out of memory matrix operation on large dataset

modin
modin copied to clipboard