parquet-java New parquet tools commands

Parquet files contain metadata about rowcount & file size. We should have new commands to get rows count & size. These command helps us to avoid parsing job logs or loading data once again just to find number of rows in data. This comes very handy in complex process chaining, post processes like stats generation, QA etc.

These command can be added in parquet-tools:

rowcount : This command gives row count in parquet input. It adds up row counts of all all files matching hadoop glob pattern. Use with option 'd' give detailed rows count of each file matching input pattern. Examples with possible combinations-

-- Row count without pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount /abc/xyz/week_id_partition=748 
Total RowCount: 2425763803

-- Row count with pattern (non detailed)
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount /abc/xyz/week_id_partition=74*
Total RowCount: 4781060690

-- Row count detailed for pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar rowcount -d /abc/xyz/week_id_partition=74*
week_id_partition=748 row count: 2425763803
week_id_partition=749 row count: 2355296887
Total RowCount: 4781060690

size : This command gives size of parquet date with multiple options
pretty, p : Human readable size
uncompressed, u : Get uncompressed size
detailed, d : Detailed sizes for each matching parquet file with summary

-- compressed bytes without pattern
hadoop jar ./parquet-tools-1.6.0rc4.jar size /abc/xyz/week_id_partition=748
Total Size: 18452348360 bytes

-- compressed human readable size without pattern
hadoop jar ./parquet-tools-1.6.0rc4.jar size -p /abc/xyz/week_id_partition=748
Total Size: 17.355 GB

-- uncompressed human readable size without pattern
 hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -pretty /abc/xyz/week_id_partition=748
Total Size: 102.505 GB



-- uncompressed detailed human readable size
hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -pretty -d /abc/xyz/week_id_partition=74*
week_id_partition=748: 102.505 GB
week_id_partition=749: 99.167 GB
Total Size: 201.671 GB

-- compressed human readable size summary
hadoop jar ./parquet-tools-1.6.0rc4.jar size -pretty /abc/xyz/week_id_partition=74*
Total Size: 34.169 GB

-- uncompressed bytes in detailed
hadoop jar ./parquet-tools-1.6.0rc4.jar size -uncompressed -d  /abc/xyz/week_id_partition=74*
week_id_partition=748: 108988759585 bytes
week_id_partition=749: 105439653433 bytes
Total Size: 214428413018 bytes

Jira ticket- https://issues.apache.org/jira/browse/PARQUET-196

Mar 04 '15 17:03 swapnilushinde

size command is still expecting glob path. I feel it is helpful but let me know if you find it otherwise.

Initially, I thought that these should work on a single file the other commands, but it sounds like you have a use case I'm not thinking about and intended for the commands to work that way. I think I can see the value of getting the total row count for a directory, since it would require adding up all of the individual counts from meta or dump. What I'm not sure is useful is the size command -- why is that needed?

Mar 06 '15 16:03 rdblue

@rdblue , As you said, I built these two commands considering, getting row counts & size of directories/globs containing parquet data assets. We have partitioned data in parquet for hive tables. It will be helpful if I can see total row count & size of complete data with it's breakdown in partitions. I can easily see if my parquet data asset evenly distribution. Having option to get detailed row count & size with glob option helps in QA steps, exporting data to other DBs etc. I wanted to build commands which will give all of above but default working like existing commands(for parquet file). Row count & size are different than existing commands as they can be expected to work for more than one file.
Let me know if this aligns with project goals. I will rewrite based on your comments.

Mar 06 '15 17:03 swapnilushinde

@swapnilushinde, that use case sounds reasonable so let's add them back.

Are there other commands that make sense to have a glob also? Someone is adding it to the schema command, see #136.

Mar 11 '15 22:03 rdblue

@rdblue I have added back those changes. I think other commands don't need glob except #136 which is already done..

Mar 16 '15 03:03 swapnilushinde

@swapnilushinde thanks! I'll take a look soon-ish.

Mar 23 '15 23:03 rdblue

Hey @swapnilushinde, @rdblue: A few comments about this approach -

a) I really like the glob idea. I think it solves a definite use-case and should stay in there. b) For both rowcount & size, one additional feature I'd like is the ability to specify the entity for which to list stats. Note that -d doesn't list the summary per file, it lists details per directory or file which matches the glob pattern. The default behavior make sense to me (without -d); with -d, I'd like the ability to specify if we want the summary per glob pattern, per file in the glob pattern, or further more - per row group within each file in the glob pattern. The intended use-case for the stats within the different row groups would be to understand how the data is being aligned v hdfs block size. Essentially, diagnosing the issues mentioned here - 1. As for implementation, we could stick to -d [<detail-depth>] or go -d vs -dd vs -ddd. Either works and feels unix-y enough for my taste. c) How about adding a flag to compute a statistical summary of the raw results displayed by the commands. Operating on the same level of detail as specified by the -d proposed above. Even a simple min/max/avg/std.dev would go a long way to understanding the distribution. d) I really like the -u flag you have in size. I'd propose to have the same functionality slightly differently. Instead of the forcing the user to pick between displaying compressed and uncompressed sizes, we should take a list of args as input and display each stat request. I'm thinking the analog of ps -o, where you can specify which columns you'd like to output per pid. In our case, I imagine it to be something like:

-o [<**c**ompressed>|<**u**ncompressed>],[...]

e) In fact with the approach mentioned above, we could simplify the implementation a bit - both rowcount & size could be aliases to an alternative command, lets say stats. This would take the analogous -o flag, along with another input type to indicate rowcount. f) A minor nit, I found this implementation of making a byte size human readable on SO - 2, it's nifty.

What do you think? I'm happy to help with the work in implementing ideas we deem useful in a follow up PR if we don't do it all here.

Apr 10 '15 14:04 prateek

@prateek Thank for your reply. I agree with you. Please find my comments below- b) We can add options or extra arguments to get rowcount and summary on file level and/or row group level. I am assuming you just want to see row counts or size per file or row group. c) I can see advantages of having statistical summary of numerical columns. Not sure how it can be done with parquet meta data but will be interesting :) d) Yes. Current size command gives either compressed or uncompressed size. f) We can change size implementation to use log and exponent instead of if else clauses. No performance gain but looks nifty !!

Overall, I am thinking of opening another PR to work on it. Let's keep this PR as it is so we can get it merged. We could open another PR to implement all above features with some more commands after brainstorming. @rdblue What are your thoughts on it? I prefer to get this PR merged and work on above features and few more in new PR.

Apr 10 '15 21:04 swapnilushinde

@swapnilushinde Ideally, I'd say we don't commit any of the stuff where we know we are going to change the interface on the cli, I think that means we would make the changes for (d) and then do the rest in another ticket. That said, I don't know how big a deal it is to change the interface. @rdblue your call.

I can pick up the parts we leave off here in this PR: 1

On 10 Apr 2015, at 17:57, Swapnil wrote:

@prateek Thank for your reply. I agree with you. Please find my comments below- b) We can add options or extra arguments to get rowcount and summary on file level and/or row group level. I am assuming you just want to see row counts or size per file or row group. c) I can see advantages of having statistical summary of numerical columns. Not sure how it can be done with parquet meta data but will be interesting :) d) Yes. Current size command gives either compressed or uncompressed size. f) We can change size implementation to use log and exponent instead of if else clauses. No performance gain but looks nifty !!

Overall, I am thinking of opening another PR to work on it. Let's keep this PR as it is so we can get it merged. We could open another PR to implement all above features with some more commands after brainstorming. @rdblue What are your thoughts on it? I prefer to get this PR merged and work on above features and few more in new PR.

Reply to this email directly or view it on GitHub: https://github.com/apache/incubator-parquet-mr/pull/132#issuecomment-91703736

Apr 13 '15 19:04 prateek

@prateek @rdblue Hello guys, sorry i was busy with other stuff last few weeks so couldn't pay attention on it. @rdblue, Could you please let us know what new changes/features you want in #132 so we can make commit sooner. As pratheek mentioned, we can work on more advanced features in separate PR.

Apr 28 '15 14:04 swapnilushinde

Why hasn't this been merged?

Apr 28 '16 14:04 kadwanev

@rdblue - It's been long time I worked on this. Let me know if you need any further changes or can be merged directly.

Apr 28 '16 18:04 swapnilushinde

Hi. This looks really useful ! Could it be merged please ?

Feb 16 '17 15:02 Lucas-C

@rdblue this looks good to go. Any other comments?

Feb 16 '17 16:02 julienledem

Looks fine to me.

Feb 16 '17 16:02 rdblue

@rdblue Thank you.. Please let me know if I need to do something. Wanting to get it merged for long time..

Feb 16 '17 19:02 swapnilushinde

@swapnil: please create a PARQUET jira for this and prefix the description with the id: PARQUET-X: ... Also rebase your branch. Thank you. When this is done, I'll merge

Feb 16 '17 19:02 julienledem

@swapnilushinde gentle nagging on PRs is always fine :). Sometimes if your comment shows up at a busy time it falls through the cracks. Thank you for your contribution.

Feb 16 '17 19:02 julienledem

@rdblue @julienledem I have created another PR with rebase. Here is the jira ticket- https://issues.apache.org/jira/browse/PARQUET-196 PR- https://github.com/Parquet/parquet-mr/pull/460

Feb 17 '17 16:02 swapnilushinde

@swapnilushinde sorry your new PR is on the old repo. use apache/parquet-mr not Parquet/parquet-mr. (merging master in your branch is fine too since we'll squash in the end)

Feb 18 '17 01:02 julienledem

@julienledem Sorry about that. Please find this PR based on apache/parquet-mr repo. PR: https://github.com/apache/parquet-mr/pull/406 Here is the jira ticket- https://issues.apache.org/jira/browse/PARQUET-196

Feb 23 '17 19:02 swapnilushinde

@julienledem @rdblue : can you please take a look at above PR?

Mar 01 '17 21:03 swapnilushinde

this is useful, can we rebase and merge this in?

Apr 25 '18 17:04 ghost

@jokomo This has been rebased and merges in different PR - https://github.com/apache/parquet-mr/pull/406

Apr 29 '18 17:04 swapnilushinde

Looks useful, why not merge after resolving conflicts?

Mar 29 '19 19:03 meetchandan

parquet-java parquet-java copied to clipboard

New parquet tools commands

parquet-java
parquet-java copied to clipboard