modin
modin copied to clipboard
Concat is slow
Hi, here is my code. I want to concatenate 5100 dataframes into one big dataframe. The memory usage of these 5100 dataframes is about 160G. However, the processing speed is really slow. I have been waiting for about 40 minutes, but it still hasn't finished. Here is my code:
import modin.pandas as pd
import os
import ray
from tqdm import tqdm
ray.init()
def merge_feather_files(folder_path, suffix):
# 获取所有符合条件的文件名
files = [f for f in os.listdir(folder_path) if f.endswith(suffix + '.feather')]
# 读取所有文件并合并
df_list = [pd.read_feather(os.path.join(folder_path, file)) for file in tqdm(files)]
merged_df = pd.concat(df_list)
# 按照日期列排序
# merged_df.sort_values(by='date', inplace=True)
merged_df.to_feather('factors_date/factors'+suffix+'.feather')
# 使用方法示例
folder_path = 'factors_batch' # 设置文件夹路径
for i in range(1,25):
merged_df = merge_feather_files(folder_path, f'_{i}')
My machine has 128 cores, I am certain that they are not all utilized for processing.
Hi, here is my code. I want to concatenate 5100 dataframes into one big dataframe. The memory usage of these 5100 dataframes is about 160G. However, the processing speed is really slow. I have been waiting for about 40 minutes, but it still hasn't finished. Here is my code:
import modin.pandas as pd import os import ray from tqdm import tqdm ray.init() def merge_feather_files(folder_path, suffix): # 获取所有符合条件的文件名 files = [f for f in os.listdir(folder_path) if f.endswith(suffix + '.feather')] # 读取所有文件并合并 df_list = [pd.read_feather(os.path.join(folder_path, file)) for file in tqdm(files)] merged_df = pd.concat(df_list) # 按照日期列排序 # merged_df.sort_values(by='date', inplace=True) merged_df.to_feather('factors_date/factors'+suffix+'.feather') # 使用方法示例 folder_path = 'factors_batch' # 设置文件夹路径 for i in range(1,25): merged_df = merge_feather_files(folder_path, f'_{i}')
My machine has 128 cores, I am certain that they are not all utilized for processing.
Here is the bug report:
UserWarning: `DataFrame.to_feather` is not currently supported by PandasOnRay, defaulting to pandas implementation.
Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.
(raylet) [2024-01-18 12:25:56,935 E 1383634 1383634] (raylet) node_manager.cc:3022: 6 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet)
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:26:59,006 E 1383634 1383634] (raylet) node_manager.cc:3022: 8 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet)
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:28:01,160 E 1383634 1383634] (raylet) node_manager.cc:3022: 8 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet)
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
(raylet) [2024-01-18 12:29:07,013 E 1383634 1383634] (raylet) node_manager.cc:3022: 5 Workers (tasks / actors) killed due to memory pressure (OOM), 0 Workers crashed due to other reasons at node (ID: 695b81b72bd60e890b66e98b555a4048a90dbdd0909377d214beb80f, IP: 192.168.8.180) over the last time period. To see more information about the Workers killed on this node, use `ray logs raylet.out -ip 192.168.8.180`
(raylet)
(raylet) Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.
The reason seems to be the same as in #6865 (to_feather
that defaults to pandas)