dataproc-templates
dataproc-templates copied to clipboard
[python][java] [enhancement] Update GCS to GCS to handle sequence files
@anuyogamlab Tagging as invalid. Also please add reasonable description to identify what is needed from such template. It is not clear what is being asked here. Also you need to label if this is a new-template or enhancement etc.
Hi Shashank,
I would like to contribute to these templates. I will DM you to check if you need more details.
Hadoop HDFS might have lots of sequence format files as they are extensively used with MapReduce. It is not possible to load to BQ directly without any parsing. There is no ingestion plugin/support to parse sequence files. I have tested out a PySpark code to read sequence files and parse at the RDD level, convert the files to Parquet which is compatible with BQ. I would like to contribute the code to Dataproc templates. Please let me know if you need any other information.
Thanks, Anu.
On Sat, Jul 30, 2022 at 2:00 PM Shashank Agarwal @.***> wrote:
@anuyogamlab https://github.com/anuyogamlab Tagging as invalid. Also please add reasonable description to identify what is needed from such template. It is not clear what is being asked here. Also you need to label if this is a new-template or enhancement etc.
— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/dataproc-templates/issues/217#issuecomment-1200266497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOVXSRJJMMLYNJ6WZUKMUWDVWVUTVANCNFSM545TX3KA . You are receiving this because you were mentioned.Message ID: @.***>
--
Anuyogam Venkataraman
@.***
Cloud Data Engineer
Toronto
416-938-1022
As discussed, update GCSToGCS python and java template to handle sequence files.
I don't have write permission to push the file under util/ folder. Please give me (GitHub: anuyogamlab) access to push file.
done
Hi Anu, I am currently unassigning you since there is no updates for a long time. If you want to contribute please comeback and self assign.
@nj1973 Can you please confirm, if this is implemented as part of https://github.com/GoogleCloudPlatform/dataproc-templates/pull/482 ?
@nj1973 Can you please confirm, if this is implemented as part of #482 ?
No, I don't believe it is.