modin icon indicating copy to clipboard operation
modin copied to clipboard

`read_csv` defaults to pandas in case of reading from buffer

Open anmyachev opened this issue 3 years ago • 1 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Modin version (modin.__version__): 7a36071c0b00e0392615a0dd9d5c2ddd5f7c0d27
  • Python version: 3.8.13
  • Code we can use to reproduce:
import modin.pandas as pd
import pandas


df = pd.DataFrame({"col1": [1,2,3,4,5], "col2": [2,3,4,5,6]})

unique_filename = "test_read_csv_buffer.csv"
df.to_csv(unique_filename)

with open(unique_filename) as buffer:
    df_pandas = pandas.read_csv(buffer)
    buffer.seek(0)
    df_modin = pd.read_csv(buffer)
    print(df_pandas, "\n")
    print(df_modin)

Describe the problem

Code was added some time ago to speed up this case, however it is currently not being used correctly. This is because the original variable (buffer) is being passed, not the filename associated with the buffer and computed by our function (name in filepath_or_buffer_md).

Wrong place: https://github.com/modin-project/modin/blob/7a36071c0b00e0392615a0dd9d5c2ddd5f7c0d27/modin/core/io/text/text_file_dispatcher.py#L990

Source code / logs

UserWarning: Ray execution environment not yet initialized. Initializing...
To remove this warning, run the following python code before doing dataframe operations:

    import ray
    ray.init(runtime_env={'env_vars': {'__MODIN_AUTOIMPORT_PANDAS__': '1'}})

UserWarning: Distributing <class 'dict'> object. This may take some time.
UserWarning: For performance reasons, the filepath will be used in place of the file handle passed in to load the data
UserWarning: Parameters provided defaulting to pandas implementation.
Please refer to https://modin.readthedocs.io/en/stable/supported_apis/defaulting_to_pandas.html for explanation.
   Unnamed: 0  col1  col2
0           0     1     2
1           1     2     3
2           2     3     4
3           3     4     5
4           4     5     6

   Unnamed: 0  col1  col2
0           0     1     2
1           1     2     3
2           2     3     4
3           3     4     5
4           4     5     6

anmyachev avatar Jul 18 '22 12:07 anmyachev

After fixing problem above it's needed to fix another one initially found in https://github.com/modin-project/modin/pull/4283#discussion_r919048273- (buffer with non-zero starting position).

Reproducer:

import modin.pandas as pd
import pandas


df = pd.DataFrame({"col1": [1,2,3,4,5], "col2": [2,3,4,5,6]})

unique_filename = "test_read_csv_buffer.csv"
df.to_csv(unique_filename)

with open(unique_filename) as buffer:
    buffer.readlines(2)
    df_pandas = pandas.read_csv(buffer)
    buffer.seek(0)
    buffer.readlines(2)
    df_modin = pd.read_csv(buffer)
    print(df_pandas, "\n")
    print(df_modin)

Output:

   0  1  2
0  1  2  3
1  2  3  4
2  3  4  5
3  4  5  6

   Unnamed: 0  col1  col2
0           0     1     2
1           1     2     3
2           2     3     4
3           3     4     5
4           4     5     6

anmyachev avatar Jul 18 '22 13:07 anmyachev