parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

Add usage documentation for the Java library

Open asfimport opened this issue 1 year ago • 10 comments

The Java parquet library has no usage documentation besides the sparse information available in the README. The only thing I could find were a few old (10yr) 3rd party tutorials scattered on the internet using the hadoop module. I spent a work day sifting through the API docs and searching on the internet to try to piece together something. Ultimately, I decided to give up on doing Parquet files using Java because there are alternative file formats that are better documented, and I felt trying to use parquet-mr would be a huge hassle to maintain in the future. This library seems reasonably maintained and comprehensive, but there is just a huge barrier to using the library which I think turns off a lot of developers like me.

I kindly request usage documentation be written to cover all the major aspects of the library, and for the more nitty gritty use cases, pointers to what API classes/methods could be looked at further.

I may be misunderstanding the purpose of this library, and if so, is there a different Java Parquet library that is recommended for higher level parquet file IO?

Reporter: Isaac Nygaard

Note: This issue was originally created as PARQUET-2490. Please see the migration documentation for further details.

asfimport avatar Jun 05 '24 18:06 asfimport

hi @asfimport ,can u brief about what should be added to readme and if the issue is open can u assign it to me so that i can work on it!!

VarshaUN avatar Aug 07 '24 16:08 VarshaUN

@VarshaUN I just assigned it to you. Please feel free to add any documentation as you see fit. Thanks for your interest!

wgtmac avatar Aug 08 '24 07:08 wgtmac

@VarshaUN I just assigned it to you. Please feel free to add any documentation as you see fit. Thanks for your interest!

hey thanks for assigning.but are there any prerequistes for contributing to this or just clone it and add doc??

i have some suggestions from myside on writing doc,

INTRODUCTION : providing an overview of parquet , its benefits and library's purpose

GETTING STARTED : I will try to cover intallation , setup ,and basic usage.

TUTORIALS AND GUIDES : I'll write step-by-step tutorials for specific use cases (I need your help in this).

API reference : I'll document classes,methods and parameters

EXAMPLES AND CODE SNIPPETS: I'll include examples

Troubleshooting and FAQS : adressing common issues and question

VarshaUN avatar Aug 08 '24 11:08 VarshaUN

There is no prerequisite. I'm not sure if your proposal is too wide to complete. In my mind it may be some code examples like what we have in Apache Arrow: https://arrow.apache.org/cookbook/java/. Parquet-java contains a lot of modules, in which I think parquet-avro and parquet-hadoop are the most important ones from the perspective of end users.

wgtmac avatar Aug 08 '24 12:08 wgtmac

There is no prerequisite. I'm not sure if your proposal is too wide to complete. In my mind it may be some code examples like what we have in Apache Arrow: https://arrow.apache.org/cookbook/java/. Parquet-java contains a lot of modules, in which I think parquet-avro and parquet-hadoop are the most important ones from the perspective of end users.

got it. SO i need to add doc about only parquet-avro and parquet-hadoop as its imp and useful??

VarshaUN avatar Aug 08 '24 12:08 VarshaUN

No, I don't mean other components (e.g. parquet-column and parquet-encoding) are not useful. They are widely used by query engines to implement parquet I/Os which is transparent to end users.

wgtmac avatar Aug 08 '24 12:08 wgtmac

No, I don't mean other components (e.g. parquet-column and parquet-encoding) are not useful. They are widely used by query engines to implement parquet I/Os which is transparent to end users.

ok i will work on it and If i have suggestions how can i contact you ?

VarshaUN avatar Aug 08 '24 12:08 VarshaUN

No, I don't mean other components (e.g. parquet-column and parquet-encoding) are not useful. They are widely used by query engines to implement parquet I/Os which is transparent to end users.

ok i will work on it and If i have suggestions how can i contact you ?

Feel free to open a pull request and we can directly discuss on it.

wgtmac avatar Aug 08 '24 12:08 wgtmac

@wgtmac I have made a PR please review it.

VarshaUN avatar Aug 09 '24 14:08 VarshaUN

any progress?

HQidea avatar Aug 05 '25 10:08 HQidea