modin `read_csv` defaults to pandas in case of reading from buffer

`read_csv` defaults to pandas in case of reading from buffer

Open anmyachev opened this issue 3 years ago • 1 comments

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Modin version (modin.__version__): 7a36071c0b00e0392615a0dd9d5c2ddd5f7c0d27
Python version: 3.8.13
Code we can use to reproduce:

import modin.pandas as pd
import pandas


df = pd.DataFrame({"col1": [1,2,3,4,5], "col2": [2,3,4,5,6]})

unique_filename = "test_read_csv_buffer.csv"
df.to_csv(unique_filename)

with open(unique_filename) as buffer:
    df_pandas = pandas.read_csv(buffer)
    buffer.seek(0)
    df_modin = pd.read_csv(buffer)
    print(df_pandas, "\n")
    print(df_modin)

Describe the problem

Code was added some time ago to speed up this case, however it is currently not being used correctly. This is because the original variable (buffer) is being passed, not the filename associated with the buffer and computed by our function (name in filepath_or_buffer_md).

Wrong place: https://github.com/modin-project/modin/blob/7a36071c0b00e0392615a0dd9d5c2ddd5f7c0d27/modin/core/io/text/text_file_dispatcher.py#L990

Source code / logs

UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})

UserWarning: Distributing <class 'dict'> object. This may take some time.
UserWarning: For performance reasons, the filepath will be used in place of the file handle passed in to load the data
UserWarning: Parameters provided defaulting to pandas implementation.
Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.
   Unnamed: 0  col1  col2
0           0     1     2
1           1     2     3
2           2     3     4
3           3     4     5
4           4     5     6

   Unnamed: 0  col1  col2
0           0     1     2
1           1     2     3
2           2     3     4
3           3     4     5
4           4     5     6

Jul 18 '22 12:07 anmyachev

After fixing problem above it's needed to fix another one initially found in https://github.com/modin-project/modin/pull/4283#discussion_r919048273- (buffer with non-zero starting position).

Reproducer:

import modin.pandas as pd
import pandas


df = pd.DataFrame({"col1": [1,2,3,4,5], "col2": [2,3,4,5,6]})

unique_filename = "test_read_csv_buffer.csv"
df.to_csv(unique_filename)

with open(unique_filename) as buffer:
    buffer.readlines(2)
    df_pandas = pandas.read_csv(buffer)
    buffer.seek(0)
    buffer.readlines(2)
    df_modin = pd.read_csv(buffer)
    print(df_pandas, "\n")
    print(df_modin)

Output:

   0  1  2
0  1  2  3
1  2  3  4
2  3  4  5
3  4  5  6

   Unnamed: 0  col1  col2
0           0     1     2
1           1     2     3
2           2     3     4
3           3     4     5
4           4     5     6

Jul 18 '22 13:07 anmyachev

modin modin copied to clipboard

`read_csv` defaults to pandas in case of reading from buffer

System information

Describe the problem

Source code / logs

modin
modin copied to clipboard