xml-avro icon indicating copy to clipboard operation
xml-avro copied to clipboard

Documentation around Splitting files

Open meticulo3366 opened this issue 4 years ago • 1 comments

Hi, I would like to know if there is any examples of splitting large files? I would like to implement it some how.

  • Converts any XSD to a proper usable Avro schema (Avsc)
  • Converts any XML to avro using the provided schema. What can it do? See the list below.
    • Handle any large size XML (even in GigaBytes), as it streams the xml
    • Read xml from stdin and output to stdout
    • Validate the XML with XSD
    • Split the data at any specified element (can have any no.of splits)
    • Handle multiple documents in single file (useful when streaming continuous data)
    • Write out failed documents without killing the whole process
    • Completely configurable

meticulo3366 avatar Feb 04 '21 19:02 meticulo3366

see https://github.com/GeethanadhP/xml-avro/blob/master/example/config.yml for a sample config, but that doesn't include config for splitting

Below is the config section for splitting the data

split:                        # Split the avro records based on specifed list
    -
      by: "bookName"            # Split tag name
      avscFile: "name.avsc"     # Avsc File for the split part
      avroFile: "name.avro"     # Avro file name to save to
    -
      by: "bookPublisher"
      avscFile: "publisher.avsc"
      avroFile: "publisher.avro"

Assuming a file having

<bookName>
    </bunch_of_data>
</bookName>
</bookPublisher>
    </bunch_of_data>
</bookPublisher>

so the first bunch goes into name.avro and second bunch goes into publisher.avro but you might have to struggle with the avsc part, Frankly don't remember much, its been around 3 years since i used ithe tool

  1. It handles gigabytes of data also very easily because it streams the data tag by tag instead of the whole xml at once
  2. multiple documents in the sense of (in the below example, assume book is your root tag, and this file has 2 messages (2 books), so each book will be stored as a record in output avro.. Generaly cases you woud only get one root tag for a file, i had this option for usage with flume where it combines a bunch of messages and saves in a single file
<book id="b001">
        <author>Brandon Sanderson</author>
        <title>Mistborn</title>
        <genre>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <review>
            <title>Wonderful</title>
            <content>I love the plot twist and the new magic</content>
        </review>
        <review>
            <title>Unbelievable twist</title>
            <content>The best book i ever read</content>
        </review>
        <sold>10</sold>
    </book>
    <book id="b002">
        <author>Brandon Sanderson</author>
        <title>Way of Kings</title>
        <genre>Fantasy</genre>
        <price>50</price>
        <pub_date>2006-12-17T09:30:47.0Z</pub_date>
        <!--<alias>-->
            <!--<title>Way of the kings</title>-->
        <!--</alias>-->
        <!--<website>-->
            <!--<url></url>-->
        <!--</website>-->
        <sold>10</sold>
    </book>

GeethanadhP avatar Feb 06 '21 00:02 GeethanadhP