stackexchange-xml-converter
stackexchange-xml-converter copied to clipboard
Feature: Create json file line by line and filter using tags
I propose two features with this MR: The first change enables changing the json output to write one json-object per line per post. The second change introduces a filter mechanism that can be used to filter the dataset while converting it. Imagine you only care about some tags related to testing. With the new features you can do something like:
./stackexchange-xml-converter \
-result-format=json \
-source-path=../data/Posts.xml\
-store-to-dir "../data" \
-filter-by-tag-id "\
tdd\
testing\
testcase testing-library\
unit-testing"\
-json-one-line
You will get a filtered dataset of posts that have one of those tags assigned. For each there is one json object per line in the resulting Posts.json
file.
Then I also added another flag that allows you to include tags where the word is contained in one of the tags.
The following would give you all posts with tags that contain the word 'testing' (e.g. unit-testing, testing-library).
./stackexchange-xml-converter \
-result-format=json \
-source-path=../data/Posts.xml
-store-to-dir "../data" \
-filter-by-tag-id "\
testing\
-json-one-line\
-filter-no-exact-match
If you approve to the changes I'd be happy to rebase and refactor.