hadoop-connectors
hadoop-connectors copied to clipboard
Implement Hadoop MultipartUploader interface to support parallel files upload
As per the GCP documentation there is a compose api which is used to merge small files into a big one. Does connector supports breaking the big file into smaller chunks and then upload the chunks in parallel and finally merge them using compose api something similar to multipart upload?
I tried uploading 1GB and 5GB file and looks like it is done sequentially.
I tried setting fs.gs.outputstream.type to SYNCABLE_COMPOSITE but I don't see any parallelism neither any improvement in runtimes for a 5GB file upload.
No, it's not supported in GCS connector, because it uploads streams, not files - parallel upload usually applicable to CLI tools that have access to a local file and can split it into chunks that can be uploaded in parallel and composed after that.
How did you upload 1GB and 5GB file using GCS connector?
Okay. Thanks for the information.
I used hadoop fs client to upload the files.
Command: hadoop fs -put <file_name>
In this case, hadoop fs should support parallel upload and concatenation. If this is already the case then we can implement this functionality in GCS connector, but AFAIK this functionality is not available in HCFS/HDFS interface.
We have multipart upload interface in hadoop and we have already implemented it for S3A. https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/multipartuploader.md
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/MultipartUploader.java https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/impl/S3AMultipartUploader.java
CC @steveloughran
we have an explicit API for multipart uploads, with conformance tests: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/multipartuploader.md this is sufficient for us to reimplement the S3A commit algorithm through a public API...which would give GCS a zero-rename option. i know GCS file rename is less expensive than on S3, so it is less critical -but it may still be beneficial.
Thank you for information, I will re-purpose this bug for implementing MultipartUploader in GCS connector.
looking forward to it. FWIW, that multipart API could also be used for a high performance parallelised distcp tool, where blocks were uploaded in parallel from multiple source workers and coalesced at the end. Distcp does something like this for if the FS supports concat (HDFS), but not for anything else