filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

Deadlock involving pandas.read_excel and AWS S3

Open tvoipio opened this issue 3 years ago • 2 comments

This may be related to https://github.com/fsspec/filesystem_spec/issues/517, but the workaround suggested there does not seem to work...

# From https://github.com/fsspec/filesystem_spec/issues/517
# Adapted for s3fs

import multiprocessing
import time

import pandas as pd
import s3fs

use_multiprocessing = True


def read_file(path):
    t0 = time.time()

    def elapsed():
        return time.time() - t0

    print(f'{elapsed():.3f} Before read_excel')
    fs = s3fs.S3FileSystem()

    # Workaround?
    fs.clear_instance_cache()

    with fs.open(path, "rb") as f:
        print(f"{elapsed():.3f} Entered context manager for {path}")
        df = pd.read_excel(f)
    print(f'{elapsed():.3f} After read excel')
    return df


try:
    df = pd.read_csv('s3://<nonexistent CSV file>')
except FileNotFoundError:
    pass

filepath = 's3://<existing Excel file>'

files = [filepath]

if use_multiprocessing:
    with multiprocessing.Pool(1) as pool:
        dfs = pool.map(read_file, files)
else:
    dfs = [read_file(p) for p in files]

Running with use_multiprocessing = True results in the program to print out 0.000 Before read_excel and then hanging. With use_multiprocessing = False, the file is read successfully in about 1 second.

I appreciate that asyncio may be difficult to do in multiprocessing environment, but in this case, I am rather stymied by the fact that there should be no shared state anywhere between the processes, as a completely fresh S3FileSystem instance is created in each subprocess. The suggested workaround of invoking clear_instance_cache() did not work either.

In my specific case, I ended up removing the s3fs dependency completely and use boto3 to download the object to a BytesIO, which is then happily ingested by Pandas (the files I have in this project are rather small, less than 1 MB). However, since Pandas uses s3fs and thus fsspec under the hood, I would like to confirm that this issue indeed persists and if so, ask Pandas maintainers to add a note to their documentation about the issue.

fsspec version 2021.11.1, Python 3.8.10, pandas 1.3.5

tvoipio avatar Jan 19 '22 15:01 tvoipio

The linked issue suggests you should avoid using fork. Did you try this? Your code does not show it.

I am rather stymied by the fact that there should be no shared state anywhere between the processes, as a completely fresh S3FileSystem instance is created in each subprocess.

when you first access s3, it not only creates the filesystem instance, it also creates a thread and an asyncio event loop in that thread. These are then copied into the forked processes, but can lo longer run, causing the hang. Removing the s3 filesystem instance is not enough. You could try adding

fsspec.asyn.iothread[0] = None
fsspec.asyn.loop[0] = None

but you would be better off switching your multiprocess launcher to spawn.

martindurant avatar Jan 20 '22 18:01 martindurant

Thanks for your reply @martindurant . I will try to test spawn and see how it affects the rest of the application components. I think my main point right now is however if this deadlock behavior, using threads below the hood (invisible to the developer unless one digs deep into dependencies' source code), and needing to use a specific multiprocessing method is something that is clear enough in the current documentation for s3fs and packages using it.

I can freely say that I am not a professional computer or software engineer, so to me e.g. this (fromhttps://s3fs.readthedocs.io/en/latest/#async ) :

Concurrent async operations are also used internally for bulk operations such as pipe/cat, get/put, cp/mv/rm. The async calls are hidden behind a synchronisation layer, so are designed to be called from normal code. If you are not using async-style programming, you do not need to know about how this works, but you might find the implementation interesting.

does not really say "but if you do happen to use multiprocessing, you do need to know how this implementation of async I/O in fsspec works and therefore be aware that multiprocessing only works with a very specific, non-default child process creation method".

This effect compounds when s3fs is used by packages directed at other communities, like pandas.

Based on this discussion and explanation of the mechanics behind the issue, I will make documentation enhancement requests to s3fs and pandas to help others who might come across the same issues.

tvoipio avatar Jan 24 '22 08:01 tvoipio