[Python] Exceptions swallowed in ParquetManifest._visit_directories
ParquetManifest._visit_directories uses a ThreadPoolExecutor to visit partitioned parquet datasets concurrently, it waits for them to finish but doesn't check if the respective futures have failed or not. This is quite tricky to detect and debug as an exception is either raised later as a a side-effect or (perhaps worse) it passes silently.
Observed on 0.12.1 but appears to be on latest master too.
Reporter: George Sakkis / @gsakkis
Note: This issue was originally created as ARROW-5825. Please see the migration documentation for further details.
Joris Van den Bossche / @jorisvandenbossche: @gsakkis do you have a reproducible example? You need a parquet dataset with eg an invalid file that raises an error when reading the metadata?
George Sakkis / @gsakkis: Yes, in my case it was "Found files in an intermediate directory" two or three levels deep in the partitioned directory tree.
Wes McKinney / @wesm: PRs welcome – note that the C++ Datasets project is getting going so we should limit the amount of time invested in maintaining this existing code.
Antoine Pitrou / @pitrou: The offending code is still there. It should also be pretty easy to fix.
Antoine Pitrou / @pitrou: cc @jorisvandenbossche
Joris Van den Bossche / @jorisvandenbossche:
Since this is in the legacy implementation, I would say it's no priority to fix. But if it's easy to fix, maybe we should just do it. I am not super familiar with concurrent.futures, but is the first answer in https://stackoverflow.com/questions/35711160/detect-failed-tasks-in-concurrent-futures the way to go? (replacing the wait call with for fut in futures.as_completed(fs): fut.result())
Antoine Pitrou / @pitrou: It depends what the intent is. If we still want to wait for all futures (instead of exiting on the first error), you would use:
if futures_list:
futures.wait(futures_list)
for fut in futures:
fut.result()
Joris Van den Bossche / @jorisvandenbossche: I suppose erroring on first exception is fine, as there should be no errors at all in a normal (successful) call to this.
Antoine Pitrou / @pitrou: The distinction I was making was between waiting for all futures to finish or not.
This issue hasn't had activity in a long time. If it's still being worked on, please leave a comment. Otherwise, it will be closed on 23rd June.
Labelled Status: Stale-Warning for tracking.