ParquetViewer icon indicating copy to clipboard operation
ParquetViewer copied to clipboard

Show dict entries in quick peek view

Open Zylence opened this issue 1 year ago • 12 comments

Hey I just found your repo and quite like it. Just noticed that dictionaries (maps) in the quick peek view never load entirely. As I personally have lots of dicts I figured to make the changes. If you are interested in pulling the changes, this PR implements loading the dictionary entries.

Before: image After: image

Zylence avatar Aug 10 '24 18:08 Zylence

Made a mistake in the implementation (reading whole column for dict entries). I ll attempt to fix it.

Zylence avatar Aug 12 '24 00:08 Zylence

Thanks for opening this PR. Can you also include a unit test please? That would allow myself and others to test the implementation. I can also help out with implementing if you could share a test file.

mukunku avatar Aug 12 '24 12:08 mukunku

Thanks for opening this PR. Can you also include a unit test please?

Sure I can do that.

I can also help out with implementing if you could share a test file.

I'll attempt to solve my parsing issue, if I can't get it done in time, I'll get back to you, thanks. 

if you could share a test file.

I wanted to share my test file with variable length dics here in case you wanted to poke at it yourself too, but github won't let me upload it.

Also figured that parsing of the dicts into the table is somewhat broken (also in the version I did not modify). The ReadMapValue method treats every row as record, whereas in a dict multiple rows can make up a record, which leads to dictionary entries ending up in the wrong row, see image. I'll attempt to fix that in the process.

Dict entries are associated to the wrong rows: image

Zylence avatar Aug 12 '24 19:08 Zylence

Also figured that parsing of the dicts into the table is somewhat broken (also in the version I did not modify). The ReadMapValue method treats every row as record, whereas in a dict multiple rows can make up a record, which leads to dictionary entries ending up in the wrong row, see image. I'll attempt to fix that in the process.

My understanding was that a Map consisted of a single key/value pair. And a dictionary would be a list of maps. But after looking at the Map type definition I might be incorrect: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

However I can't help much without a test file unfortunately. If you find a way to share one let me know. Since I wrote the code with the assumption of Map = 1 row it might be hard to rewrite it yourself.

mukunku avatar Aug 12 '24 19:08 mukunku

My understanding was that a Map consisted of a single key/value pair.

You may be correct - I not very familiar with the parquet file format - but I read that the repetition level, when is 0 for a map value, marks the beginning of a new record. This should enable one to read rows until hitting the next one with repetition level 0 and hence correctly grouping rows to map records. 

I also took a look at another table viewer, they group the dict entries like I thought they should be grouped: image

However I can't help much without a test file unfortunately. If you find a way to share one let me know. Since I wrote the code with the assumption of Map = 1 row it might be hard to rewrite it yourself.

I could send it to you via mail, share via drive, maybe we find a pastebin, whatever would suit you best. 

Zylence avatar Aug 12 '24 20:08 Zylence

Sure, if you could share the link here that would be great. I've had success zipping my parquet files and uploading them here. Example: MAP_TYPE_TEST1.zip

mukunku avatar Aug 12 '24 20:08 mukunku

Sure, if you could share the link here that would be great. I've had success zipping my parquet files and uploading them here. Example: MAP_TYPE_TEST1.zip

Totally forgot zipping was an option.  sample_data.zip

Zylence avatar Aug 12 '24 20:08 Zylence

Made it parse and display values at their correct positions (or at least where I think they should be) now fixing quick peek and doing some cleanup on my side, will push right after:

image

Zylence avatar Aug 12 '24 20:08 Zylence

Made the display of the values work but completely ignored the enigne and, in the process, broke the map related tests. 😅 Thought I'd simply introduce a new type, but maybe just MapValue should have been adjusted. I ll take some time to read the sources about parquet you posted earlier. I'll get back to it over the course of the next days.

Zylence avatar Aug 12 '24 22:08 Zylence

I read the definitions of the MAP type you posted earlier, and finally came to the conclusion that adding a new type just for the convenience of grouping map values is likely not the way to go. 

Those changes have been reverted. The MapValue was adjusted to be able to evaluate its "neighbour" upon request. All MapValues now have the ability to evaluate the MapValues from the same map that follow after it (row-wise).


Let's assume there is a Map spanning rows [4, 7]. The user parses row 4 as MapValue. The map value he receives contains an Iterator that parses rows [4, 7] (as MapValues), thus storing information about the logical grouping of the MapValues. 

The benefits of this are:

  • All rows read by ReadRowsAsync are still MapValue types.
  • Records are no longer offset to the wrong rows in the MainView.
  • For display, we only need to evaluate the first element of each map record, only if the user clicks the map record can we can lazily evaluate all the other associated map entries.

The requested test was added, it showcases with the parquet file shared here earlier, how the newly added functionality works. 

Also out of curiosity. I did some performance tests with large_maps.zip (steps of 50k map entries per row) and compared results with parquet floor. Your viewer's time to interactive on large maps is faster than theirs, and subsequent scrolling is smoother. :D And the QuickPeekViews rows are not even lazy loading yet. If, at some point, you decide to implement virtual scrolling or similar, it should experience a performance uplift again. 

Did you already implement SQL queries that can include MapValues? If so, I'd like to try them out, to ensure the changes did not break anything in this area of the project. 

Zylence avatar Aug 18 '24 23:08 Zylence

@Zylence Thanks for your amazing work. I'll make sure to take a look at your changes the first chance I get!

Did you already implement SQL queries that can include MapValues? If so, I'd like to try them out, to ensure the changes did not break anything in this area of the project.

Complex types including maps are just being handled by casting them to their string representation: https://github.com/mukunku/ParquetViewer/blob/50c8ad9f1b582154feec28ae7f44d2e868ffc84d/src/ParquetViewer/MainForm.cs#L420-L428 It's not possible to query it in a structured way at the moment unfortunately. So best you can do is use the LIKE operator to search for a substring inside the MapValue.

mukunku avatar Aug 19 '24 12:08 mukunku

I'll make sure to take a look at your changes the first chance I get!

Take your time. :)

Thanks for the info regarding the qurying.

I tested querying (exploratory). It seems to work like before. The evaluated entries that are displayed can be compared. The majority of map entries, that are not evaluated, can not be included in the comparison. But it seems to have been like that before. 

image

Zylence avatar Aug 20 '24 19:08 Zylence

Sorry for the delay. Merging this in now. Thanks so much for the contribution! I'll try to get a release out soon.

mukunku avatar Nov 20 '24 16:11 mukunku

Hello, no worries; I was not in a hurry, since I could use the changes locally ^^.  Just want to let you know, your tool helped me out a lot the past few weeks. Keep it up! :) And thanks for merging!

Zylence avatar Nov 24 '24 18:11 Zylence