spark-distcp icon indicating copy to clipboard operation
spark-distcp copied to clipboard

A re-implementation of Hadoop DistCP in Apache Spark

Results 10 spark-distcp issues
Sort by recently updated
recently updated
newest added

https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html#Command_Line_Options > -direct | Write directly to destination paths | Useful for avoiding potentially very expensive temporary file rename operations when the destination is an object store https://github.com/CoxAutomotiveDataSolutions/spark-distcp/blob/v0.2.5/src/main/scala/com/coxautodata/utils/CopyUtils.scala#L380

Hi, When I tried to use this API to copy files from S3 to S3, It is giving error. I noticed it is due to the file rename operation. Is...

Trying to fix #10. I introduced a configurable "missing directory" strategy: * "fail" is default behaviour (HDFS) * "create" auto creates target directory if not exist (HDFS) * "log" doesn't...

``` ./spark-distcp.sh hdfs://sourcehost.company.fr:8020/source/dir hdfs://targethost.company.fr:8020/target/dir 2022-01-14 12:29:36.561 [main] ERROR com.coxautodata.utils.FileListUtils - Exception during file listing java.io.FileNotFoundException: File hdfs://targethost.company.fr:8020/target/dir does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1281) at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.(DistributedFileSystem.java:1255) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1200) at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1196) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)...

When source and destination are files not directories, Spark DistCP fails: ``` 2022-02-04 15:58:43,344 [main] INFO org.apache.spark.scheduler.DAGScheduler - Job 1 finished: isEmpty at FileListUtils.scala:256, took 0.040579 s Exception in thread...

We would like to use spark-distcp do copy file between HDFS and S3, as the filesystem implementation is not the same `sourceFS.getFileChecksum(definition.source.getPath))` and `destFS.getFileChecksum(destPath)` might not be the same. We...

### Why changes are needed? Since Spark 3.3, it migrate from Log4J 1.x to Log4J 2.x, see details in [SPARK-37814](https://issues.apache.org/jira/browse/SPARK-37814) Currently, the `Logging` has a hard dependency on Log4J1, which...

I have a requirement to synchronize data from a hadoop cluster with kereberos authentication to a hadoop cluster without kerberos authentication

## Summary This pull request aims to add option `--compression` before copying on hdfs://