SkillsExtractorCognitiveSearch icon indicating copy to clipboard operation
SkillsExtractorCognitiveSearch copied to clipboard

Remove absolute duplicates from skill_patterns.jsonl

Open TheOnlyWayUp opened this issue 2 years ago • 0 comments

Was going through the data when I saw that there were a few instances of duplicated patterns.

I wrote a quick python script to remove absolute duplicates (objects are completely equal)

import json
read_objects = []
parsed_objects = []
with open("skill_patterns.jsonl") as h:
    for line in h.readlines():
        if line not in read_objects:
            read_objects.append(line)
            parsed_objects.append(json.loads(line))
with open("skill_patterns.jsonl", "w") as h:
    for item in parsed_objects:
        h.write(json.dumps(item, separators=(",", ":")) + "\n")

TheOnlyWayUp avatar Sep 27 '22 08:09 TheOnlyWayUp