spark-distcp icon indicating copy to clipboard operation
spark-distcp copied to clipboard

When target directory not exists, distcp fails due to listing

Open gquintana opened this issue 3 years ago • 3 comments

 ./spark-distcp.sh hdfs://sourcehost.company.fr:8020/source/dir hdfs://targethost.company.fr:8020/target/dir
 
2022-01-14 12:29:36.561 [main] ERROR com.coxautodata.utils.FileListUtils - Exception during file listing
java.io.FileNotFoundException: File hdfs://targethost.company.fr:8020/target/dir does not exist.
        at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1281)
        at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1255)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1200)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1196)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1214)
        at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2162)

I have to manually create target dir before

gquintana avatar Jan 14 '22 11:01 gquintana

I do like the idea of having the tool create the target directory, but I think it would be best to gate this behind a new config flag. Perhaps something like --create-target-dir would be appropriate. Will have a look at this.

jamesfielder avatar Jan 16 '22 19:01 jamesfielder

There might also be some utility in having the tool suggest that either you need to create the target or pass the create target flag to have it do it instead of throwing a FileNotFoundException.

jamesfielder avatar Jan 16 '22 19:01 jamesfielder

When running spark-distcp on S3 (and other object storages) where directories do not really exists:

2022-02-04 16:55:51,354 [main] ERROR com.coxautodata.utils.FileListUtils - Exception during file listing
java.io.FileNotFoundException: No such file or directory: s3a://s3bucket/foo/bar
        at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3356)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listLocatedStatus$23(S3AFileSystem.java:4564)
        at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:115)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.listLocatedStatus(S3AFileSystem.java:4553)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.listLocatedStatus(S3AFileSystem.java:4528)
        at com.coxautodata.utils.FileListUtils$FileLister$1.run(FileListUtils.scala:97)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

gquintana avatar Feb 04 '22 15:02 gquintana