spark-distcp
spark-distcp copied to clipboard
When target directory not exists, distcp fails due to listing
./spark-distcp.sh hdfs://sourcehost.company.fr:8020/source/dir hdfs://targethost.company.fr:8020/target/dir
2022-01-14 12:29:36.561 [main] ERROR com.coxautodata.utils.FileListUtils - Exception during file listing
java.io.FileNotFoundException: File hdfs://targethost.company.fr:8020/target/dir does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1281)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1255)
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1200)
at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1196)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1214)
at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2162)
I have to manually create target dir before
I do like the idea of having the tool create the target directory, but I think it would be best to gate this behind a new config flag. Perhaps something like --create-target-dir
would be appropriate. Will have a look at this.
There might also be some utility in having the tool suggest that either you need to create the target or pass the create target flag to have it do it instead of throwing a FileNotFoundException
.
When running spark-distcp on S3 (and other object storages) where directories do not really exists:
2022-02-04 16:55:51,354 [main] ERROR com.coxautodata.utils.FileListUtils - Exception during file listing
java.io.FileNotFoundException: No such file or directory: s3a://s3bucket/foo/bar
at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3356)
at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185)
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:3053)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$listLocatedStatus$23(S3AFileSystem.java:4564)
at org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:115)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listLocatedStatus(S3AFileSystem.java:4553)
at org.apache.hadoop.fs.s3a.S3AFileSystem.listLocatedStatus(S3AFileSystem.java:4528)
at com.coxautodata.utils.FileListUtils$FileLister$1.run(FileListUtils.scala:97)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)