xml-avro
xml-avro copied to clipboard
Documentation around Splitting files
Hi, I would like to know if there is any examples of splitting large files? I would like to implement it some how.
- Converts any XSD to a proper usable Avro schema (Avsc)
- Converts any XML to avro using the provided schema. What can it do? See the list below.
- Handle any large size XML (even in GigaBytes), as it streams the xml
- Read xml from stdin and output to stdout
- Validate the XML with XSD
- Split the data at any specified element (can have any no.of splits)
- Handle multiple documents in single file (useful when streaming continuous data)
- Write out failed documents without killing the whole process
- Completely configurable
see https://github.com/GeethanadhP/xml-avro/blob/master/example/config.yml
for a sample config, but that doesn't include config for splitting
Below is the config section for splitting the data
split: # Split the avro records based on specifed list
-
by: "bookName" # Split tag name
avscFile: "name.avsc" # Avsc File for the split part
avroFile: "name.avro" # Avro file name to save to
-
by: "bookPublisher"
avscFile: "publisher.avsc"
avroFile: "publisher.avro"
Assuming a file having
<bookName>
</bunch_of_data>
</bookName>
</bookPublisher>
</bunch_of_data>
</bookPublisher>
so the first bunch goes into name.avro
and second bunch goes into publisher.avro
but you might have to struggle with the avsc part, Frankly don't remember much, its been around 3 years since i used ithe tool
- It handles gigabytes of data also very easily because it streams the data tag by tag instead of the whole xml at once
- multiple documents in the sense of (in the below example, assume
book
is your root tag, and this file has 2 messages (2 books), so each book will be stored as a record in output avro.. Generaly cases you woud only get one root tag for a file, i had this option for usage with flume where it combines a bunch of messages and saves in a single file
<book id="b001">
<author>Brandon Sanderson</author>
<title>Mistborn</title>
<genre>Fantasy</genre>
<price>50</price>
<pub_date>2006-12-17T09:30:47.0Z</pub_date>
<review>
<title>Wonderful</title>
<content>I love the plot twist and the new magic</content>
</review>
<review>
<title>Unbelievable twist</title>
<content>The best book i ever read</content>
</review>
<sold>10</sold>
</book>
<book id="b002">
<author>Brandon Sanderson</author>
<title>Way of Kings</title>
<genre>Fantasy</genre>
<price>50</price>
<pub_date>2006-12-17T09:30:47.0Z</pub_date>
<!--<alias>-->
<!--<title>Way of the kings</title>-->
<!--</alias>-->
<!--<website>-->
<!--<url></url>-->
<!--</website>-->
<sold>10</sold>
</book>