modin
modin copied to clipboard
`read_csv` defaults to pandas in case of reading from buffer
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Windows 10 - Modin version (
modin.__version__): 7a36071c0b00e0392615a0dd9d5c2ddd5f7c0d27 - Python version:
3.8.13 - Code we can use to reproduce:
import modin.pandas as pd
import pandas
df = pd.DataFrame({"col1": [1,2,3,4,5], "col2": [2,3,4,5,6]})
unique_filename = "test_read_csv_buffer.csv"
df.to_csv(unique_filename)
with open(unique_filename) as buffer:
df_pandas = pandas.read_csv(buffer)
buffer.seek(0)
df_modin = pd.read_csv(buffer)
print(df_pandas, "\n")
print(df_modin)
Describe the problem
Code was added some time ago to speed up this case, however it is currently not being used correctly. This is because the original variable (buffer) is being passed, not the filename associated with the buffer and computed by our function (name in filepath_or_buffer_md).
Wrong place: https://github.com/modin-project/modin/blob/7a36071c0b00e0392615a0dd9d5c2ddd5f7c0d27/modin/core/io/text/text_file_dispatcher.py#L990
Source code / logs
UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:
import ray
ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})
UserWarning: Distributing <class 'dict'> object. This may take some time.
UserWarning: For performance reasons, the filepath will be used in place of the file handle passed in to load the data
UserWarning: Parameters provided defaulting to pandas implementation.
Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.
Unnamed: 0 col1 col2
0 0 1 2
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
Unnamed: 0 col1 col2
0 0 1 2
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6
After fixing problem above it's needed to fix another one initially found in https://github.com/modin-project/modin/pull/4283#discussion_r919048273- (buffer with non-zero starting position).
Reproducer:
import modin.pandas as pd
import pandas
df = pd.DataFrame({"col1": [1,2,3,4,5], "col2": [2,3,4,5,6]})
unique_filename = "test_read_csv_buffer.csv"
df.to_csv(unique_filename)
with open(unique_filename) as buffer:
buffer.readlines(2)
df_pandas = pandas.read_csv(buffer)
buffer.seek(0)
buffer.readlines(2)
df_modin = pd.read_csv(buffer)
print(df_pandas, "\n")
print(df_modin)
Output:
0 1 2
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
Unnamed: 0 col1 col2
0 0 1 2
1 1 2 3
2 2 3 4
3 3 4 5
4 4 5 6