arrow icon indicating copy to clipboard operation
arrow copied to clipboard

[Python] Exceptions swallowed in ParquetManifest._visit_directories

Open asfimport opened this issue 6 years ago • 10 comments

ParquetManifest._visit_directories uses a ThreadPoolExecutor to visit partitioned parquet datasets concurrently, it waits for them to finish but doesn't check if the respective futures have failed or not. This is quite tricky to detect and debug as an exception is either raised later as a a side-effect or (perhaps worse) it passes silently.

Observed on 0.12.1 but appears to be on latest master too.

Reporter: George Sakkis / @gsakkis

Note: This issue was originally created as ARROW-5825. Please see the migration documentation for further details.

asfimport avatar Jul 02 '19 19:07 asfimport

Joris Van den Bossche / @jorisvandenbossche: @gsakkis do you have a reproducible example? You need a parquet dataset with eg an invalid file that raises an error when reading the metadata?

asfimport avatar Jul 04 '19 15:07 asfimport

George Sakkis / @gsakkis: Yes, in my case it was "Found files in an intermediate directory" two or three levels deep in the partitioned directory tree.

 

asfimport avatar Jul 05 '19 06:07 asfimport

Wes McKinney / @wesm: PRs welcome – note that the C++ Datasets project is getting going so we should limit the amount of time invested in maintaining this existing code.

asfimport avatar Aug 19 '19 19:08 asfimport

Antoine Pitrou / @pitrou: The offending code is still there. It should also be pretty easy to fix.

asfimport avatar Aug 05 '21 14:08 asfimport

Antoine Pitrou / @pitrou: cc @jorisvandenbossche

asfimport avatar Aug 05 '21 14:08 asfimport

Joris Van den Bossche / @jorisvandenbossche: Since this is in the legacy implementation, I would say it's no priority to fix. But if it's easy to fix, maybe we should just do it. I am not super familiar with concurrent.futures, but is the first answer in https://stackoverflow.com/questions/35711160/detect-failed-tasks-in-concurrent-futures the way to go? (replacing the wait call with for fut in futures.as_completed(fs): fut.result())

asfimport avatar Aug 05 '21 15:08 asfimport

Antoine Pitrou / @pitrou: It depends what the intent is. If we still want to wait for all futures (instead of exiting on the first error), you would use:


        if futures_list:
            futures.wait(futures_list)
            for fut in futures:
                fut.result()

asfimport avatar Aug 05 '21 15:08 asfimport

Joris Van den Bossche / @jorisvandenbossche: I suppose erroring on first exception is fine, as there should be no errors at all in a normal (successful) call to this.

asfimport avatar Aug 05 '21 15:08 asfimport

Antoine Pitrou / @pitrou: The distinction I was making was between waiting for all futures to finish or not.

asfimport avatar Aug 05 '21 15:08 asfimport

This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.

Labelled Status: Stale-Warning for tracking.

thisisnic avatar Jun 21 '25 08:06 thisisnic