dataproc-templates icon indicating copy to clipboard operation
dataproc-templates copied to clipboard

[python][java] [enhancement] Update GCS to GCS to handle sequence files

Open anuyogamlab opened this issue 2 years ago • 8 comments

anuyogamlab avatar Jul 28 '22 15:07 anuyogamlab

@anuyogamlab Tagging as invalid. Also please add reasonable description to identify what is needed from such template. It is not clear what is being asked here. Also you need to label if this is a new-template or enhancement etc.

shashank-google avatar Jul 30 '22 18:07 shashank-google

Hi Shashank,

I would like to contribute to these templates. I will DM you to check if you need more details.

Hadoop HDFS might have lots of sequence format files as they are extensively used with MapReduce. It is not possible to load to BQ directly without any parsing. There is no ingestion plugin/support to parse sequence files. I have tested out a PySpark code to read sequence files and parse at the RDD level, convert the files to Parquet which is compatible with BQ. I would like to contribute the code to Dataproc templates. Please let me know if you need any other information.

Thanks, Anu.

On Sat, Jul 30, 2022 at 2:00 PM Shashank Agarwal @.***> wrote:

@anuyogamlab https://github.com/anuyogamlab Tagging as invalid. Also please add reasonable description to identify what is needed from such template. It is not clear what is being asked here. Also you need to label if this is a new-template or enhancement etc.

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/dataproc-templates/issues/217#issuecomment-1200266497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOVXSRJJMMLYNJ6WZUKMUWDVWVUTVANCNFSM545TX3KA . You are receiving this because you were mentioned.Message ID: @.***>

--

Anuyogam Venkataraman

@.***

Cloud Data Engineer

Toronto

416-938-1022

anuyogamlab avatar Aug 02 '22 00:08 anuyogamlab

As discussed, update GCSToGCS python and java template to handle sequence files.

shashank-google avatar Sep 20 '22 16:09 shashank-google

I don't have write permission to push the file under util/ folder. Please give me (GitHub: anuyogamlab) access to push file.

anuyogamlab avatar Oct 07 '22 14:10 anuyogamlab

done

shashank-google avatar Oct 07 '22 17:10 shashank-google

Hi Anu, I am currently unassigning you since there is no updates for a long time. If you want to contribute please comeback and self assign.

ClementineJoe avatar Mar 31 '23 04:03 ClementineJoe

@nj1973 Can you please confirm, if this is implemented as part of https://github.com/GoogleCloudPlatform/dataproc-templates/pull/482 ?

vanshaj-bhatia avatar Feb 23 '24 07:02 vanshaj-bhatia

@nj1973 Can you please confirm, if this is implemented as part of #482 ?

No, I don't believe it is.

nj1973 avatar Feb 23 '24 15:02 nj1973