modin Concat is slow

Hi, here is my code. I want to concatenate 5100 dataframes into one big dataframe. The memory usage of these 5100 dataframes is about 160G. However, the processing speed is really slow. I have been waiting for about 40 minutes, but it still hasn't finished. Here is my code:

import modin.pandas as pd
import os
import ray
from tqdm import tqdm

ray.init()

def merge_feather_files(folder_path, suffix):
    # 获取所有符合条件的文件名
    files = [f for f in os.listdir(folder_path) if f.endswith(suffix + '.feather')]

    # 读取所有文件并合并
    df_list = [pd.read_feather(os.path.join(folder_path, file)) for file in tqdm(files)]
    merged_df = pd.concat(df_list)

    # 按照日期列排序
    # merged_df.sort_values(by='date', inplace=True)
    merged_df.to_feather('factors_date/factors'+suffix+'.feather')
    

# 使用方法示例
folder_path = 'factors_batch'  # 设置文件夹路径
for i in range(1,25):   
    merged_df = merge_feather_files(folder_path, f'_{i}')

My machine has 128 cores, I am certain that they are not all utilized for processing.

Jan 18 '24 03:01 river7816

Hi, here is my code. I want to concatenate 5100 dataframes into one big dataframe. The memory usage of these 5100 dataframes is about 160G. However, the processing speed is really slow. I have been waiting for about 40 minutes, but it still hasn't finished. Here is my code:
import modin.pandas as pd
import os
import ray
from tqdm import tqdm

ray.init()

def merge_feather_files(folder_path, suffix):
    # 获取所有符合条件的文件名
    files = [f for f in os.listdir(folder_path) if f.endswith(suffix + '.feather')]

    # 读取所有文件并合并
    df_list = [pd.read_feather(os.path.join(folder_path, file)) for file in tqdm(files)]
    merged_df = pd.concat(df_list)

    # 按照日期列排序
    # merged_df.sort_values(by='date', inplace=True)
    merged_df.to_feather('factors_date/factors'+suffix+'.feather')
    

# 使用方法示例
folder_path = 'factors_batch'  # 设置文件夹路径
for i in range(1,25):   
    merged_df = merge_feather_files(folder_path, f'_{i}')
My machine has 128 cores, I am certain that they are not all utilized for processing.

Here is the bug report:

UserWarning: `DataFrame.to_feather` is not currently supported by PandasOnRay, defaulting to pandas implementation.
Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.
(raylet) [2024-01-18 12:25:56,935 E 1383634 1383634] (raylet) node_manager.cc:3022: 6 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:26:59,006 E 1383634 1383634] (raylet) node_manager.cc:3022: 8 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:28:01,160 E 1383634 1383634] (raylet) node_manager.cc:3022: 8 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:29:07,013 E 1383634 1383634] (raylet) node_manager.cc:3022: 5 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet) 
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

Jan 18 '24 04:01 river7816

The reason seems to be the same as in #6865 (to_feather that defaults to pandas)

Jan 19 '24 15:01 anmyachev

modin modin copied to clipboard

Concat is slow

modin
modin copied to clipboard