drill icon indicating copy to clipboard operation
drill copied to clipboard

DRILL-6820: Msgpack format reader

Open jcmcote opened this issue 6 years ago • 11 comments

Implementation of a msgpack format reader

  • schema learning
  • skip over malformed records
  • skip over invalid field names
  • skip over records not matching schema
  • writing msgpack has not yet been implemented

implementation of a zstandard codec

  • only decompression is implemented

jcmcote avatar Oct 11 '18 12:10 jcmcote

@jcmcote could you add a corresponding JIRA as a prefix in the title of the pull request? Refer the format of other pull requests here: https://github.com/apache/drill/pulls

vdiravka avatar Oct 30 '18 10:10 vdiravka

@jcmcote, in HADOOP-13578 was added ZStandard Compression to the hadoop library. I think it would be better to collaborate with existing well-tested implementation instead of introducing the custom one.

vvysotskyi avatar Nov 05 '18 09:11 vvysotskyi

@jcmcote, in HADOOP-13578 was added ZStandard Compression to the hadoop library. I think it would be better to collaborate with existing well-tested implementation instead of introducing the custom one.

Agreed. When will drill pickup the new version of Hadoop. Is that a big deal to upgrade the version of Hadoop used?

jcmcote avatar Nov 06 '18 21:11 jcmcote

@jcmcote There is a Jira ticket for Hadoop libs version update: DRILL-6540. There is an issue related to commons-logging, see details. Also there is my "work in progress" branch in the ticket.

vdiravka avatar Nov 07 '18 12:11 vdiravka

@jcmcote, Is it possible to split this pull request into two parts: leave here only changes connected with Msgpack format reader, and continue work on Compression codecs in the scope of a separate Jira after upgrade of Hadoop library is done?

vvysotskyi avatar Nov 07 '18 12:11 vvysotskyi

@vvysotskyi Sure I can split them up. Should be easy to do.

jcmcote avatar Nov 07 '18 16:11 jcmcote

Hey @paul-rogers I've made many code review fixes and improvements to the msgpack reader. Could you have another look at it. I would very much like to have it approved and made part of the main code base. Thanks!

jcmcote avatar Jan 10 '19 14:01 jcmcote

@jcmcote taking into account that there is ongoing work to provide schema using file (https://issues.apache.org/jira/browse/DRILL-6835). You might consider waiting for those changes to be published to use common approach of reading and writing schema files.

arina-ielchiieva avatar Jan 10 '19 14:01 arina-ielchiieva

okay sounds good

On Thu, Jan 10, 2019 at 9:54 AM Arina Ielchiieva [email protected] wrote:

@jcmcote https://github.com/jcmcote taking into account that there is ongoing work to provide schema using file ( https://issues.apache.org/jira/browse/DRILL-6835). You might consider waiting for those changes to be published to use common approach of reading and writing schema files.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/drill/pull/1500#issuecomment-453122964, or mute the thread https://github.com/notifications/unsubscribe-auth/AJoEwoWtRJHSjuYjXhk7st8u65k9vua_ks5vB1QXgaJpZM4XXfMY .

jcmcote avatar Jan 10 '19 15:01 jcmcote

Hi @jcmcote Are you still interested in completing this PR? Recently, the enhanced vector format PRs were committed and could make this better and easier.

If you haven't seen this, here's a link to the tutorial by @paul-rogers https://github.com/paul-rogers/drill/wiki/EVF-Tutorial-Row-Batch-Reader.

cgivre avatar Jul 21 '19 20:07 cgivre

Hi @jcmcote Are you still interested in completing this PR?

cgivre avatar Sep 17 '19 12:09 cgivre