Show dict entries in quick peek view
Hey I just found your repo and quite like it. Just noticed that dictionaries (maps) in the quick peek view never load entirely. As I personally have lots of dicts I figured to make the changes. If you are interested in pulling the changes, this PR implements loading the dictionary entries.
Before:
After:
Made a mistake in the implementation (reading whole column for dict entries). I ll attempt to fix it.
Thanks for opening this PR. Can you also include a unit test please? That would allow myself and others to test the implementation. I can also help out with implementing if you could share a test file.
Thanks for opening this PR. Can you also include a unit test please?
Sure I can do that.
I can also help out with implementing if you could share a test file.
I'll attempt to solve my parsing issue, if I can't get it done in time, I'll get back to you, thanks.
if you could share a test file.
I wanted to share my test file with variable length dics here in case you wanted to poke at it yourself too, but github won't let me upload it.
Also figured that parsing of the dicts into the table is somewhat broken (also in the version I did not modify). The ReadMapValue method treats every row as record, whereas in a dict multiple rows can make up a record, which leads to dictionary entries ending up in the wrong row, see image. I'll attempt to fix that in the process.
Dict entries are associated to the wrong rows:
Also figured that parsing of the dicts into the table is somewhat broken (also in the version I did not modify). The ReadMapValue method treats every row as record, whereas in a dict multiple rows can make up a record, which leads to dictionary entries ending up in the wrong row, see image. I'll attempt to fix that in the process.
My understanding was that a Map consisted of a single key/value pair. And a dictionary would be a list of maps. But after looking at the Map type definition I might be incorrect: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps
However I can't help much without a test file unfortunately. If you find a way to share one let me know. Since I wrote the code with the assumption of Map = 1 row it might be hard to rewrite it yourself.
My understanding was that a Map consisted of a single key/value pair.
You may be correct - I not very familiar with the parquet file format - but I read that the repetition level, when is 0 for a map value, marks the beginning of a new record. This should enable one to read rows until hitting the next one with repetition level 0 and hence correctly grouping rows to map records.
I also took a look at another table viewer, they group the dict entries like I thought they should be grouped:
However I can't help much without a test file unfortunately. If you find a way to share one let me know. Since I wrote the code with the assumption of Map = 1 row it might be hard to rewrite it yourself.
I could send it to you via mail, share via drive, maybe we find a pastebin, whatever would suit you best.
Sure, if you could share the link here that would be great. I've had success zipping my parquet files and uploading them here. Example: MAP_TYPE_TEST1.zip
Sure, if you could share the link here that would be great. I've had success zipping my parquet files and uploading them here. Example: MAP_TYPE_TEST1.zip
Totally forgot zipping was an option. sample_data.zip
Made it parse and display values at their correct positions (or at least where I think they should be) now fixing quick peek and doing some cleanup on my side, will push right after:
Made the display of the values work but completely ignored the enigne and, in the process, broke the map related tests. 😅 Thought I'd simply introduce a new type, but maybe just MapValue should have been adjusted. I ll take some time to read the sources about parquet you posted earlier. I'll get back to it over the course of the next days.
I read the definitions of the MAP type you posted earlier, and finally came to the conclusion that adding a new type just for the convenience of grouping map values is likely not the way to go.
Those changes have been reverted. The MapValue was adjusted to be able to evaluate its "neighbour" upon request. All MapValues now have the ability to evaluate the MapValues from the same map that follow after it (row-wise).
Let's assume there is a Map spanning rows [4, 7]. The user parses row 4 as MapValue. The map value he receives contains an Iterator that parses rows [4, 7] (as MapValues), thus storing information about the logical grouping of the MapValues.
The benefits of this are:
- All rows read by ReadRowsAsync are still MapValue types.
- Records are no longer offset to the wrong rows in the MainView.
- For display, we only need to evaluate the first element of each map record, only if the user clicks the map record can we can lazily evaluate all the other associated map entries.
The requested test was added, it showcases with the parquet file shared here earlier, how the newly added functionality works.
Also out of curiosity. I did some performance tests with large_maps.zip (steps of 50k map entries per row) and compared results with parquet floor. Your viewer's time to interactive on large maps is faster than theirs, and subsequent scrolling is smoother. :D And the QuickPeekViews rows are not even lazy loading yet. If, at some point, you decide to implement virtual scrolling or similar, it should experience a performance uplift again.
Did you already implement SQL queries that can include MapValues? If so, I'd like to try them out, to ensure the changes did not break anything in this area of the project.
@Zylence Thanks for your amazing work. I'll make sure to take a look at your changes the first chance I get!
Did you already implement SQL queries that can include MapValues? If so, I'd like to try them out, to ensure the changes did not break anything in this area of the project.
Complex types including maps are just being handled by casting them to their string representation: https://github.com/mukunku/ParquetViewer/blob/50c8ad9f1b582154feec28ae7f44d2e868ffc84d/src/ParquetViewer/MainForm.cs#L420-L428
It's not possible to query it in a structured way at the moment unfortunately. So best you can do is use the LIKE operator to search for a substring inside the MapValue.
I'll make sure to take a look at your changes the first chance I get!
Take your time. :)
Thanks for the info regarding the qurying.
I tested querying (exploratory). It seems to work like before. The evaluated entries that are displayed can be compared. The majority of map entries, that are not evaluated, can not be included in the comparison. But it seems to have been like that before.
Sorry for the delay. Merging this in now. Thanks so much for the contribution! I'll try to get a release out soon.
Hello, no worries; I was not in a hurry, since I could use the changes locally ^^. Just want to let you know, your tool helped me out a lot the past few weeks. Keep it up! :) And thanks for merging!