stackexchange-xml-converter icon indicating copy to clipboard operation
stackexchange-xml-converter copied to clipboard

Feature: Create json file line by line and filter using tags

Open mswillus opened this issue 2 years ago • 0 comments

I propose two features with this MR: The first change enables changing the json output to write one json-object per line per post. The second change introduces a filter mechanism that can be used to filter the dataset while converting it. Imagine you only care about some tags related to testing. With the new features you can do something like:

./stackexchange-xml-converter \
    -result-format=json \
    -source-path=../data/Posts.xml\
    -store-to-dir "../data" \
    -filter-by-tag-id "\
        tdd\
        testing\
        testcase testing-library\
        unit-testing"\
    -json-one-line

You will get a filtered dataset of posts that have one of those tags assigned. For each there is one json object per line in the resulting Posts.json file. Then I also added another flag that allows you to include tags where the word is contained in one of the tags. The following would give you all posts with tags that contain the word 'testing' (e.g. unit-testing, testing-library).

./stackexchange-xml-converter \
    -result-format=json \
    -source-path=../data/Posts.xml
    -store-to-dir "../data" \
    -filter-by-tag-id "\
        testing\
    -json-one-line\
    -filter-no-exact-match

If you approve to the changes I'd be happy to rebase and refactor.

mswillus avatar May 24 '22 16:05 mswillus