Scribe-Data
Scribe-Data copied to clipboard
Fix/parsing logic error for sax
Contributor checklist
- [x] This pull request is on a separate branch and not the main branch
- [x] I have tested my code with the
pytestcommand as directed in the testing section of the contributing guide
Description
Fix:
The enwiki's ".ndjson" was always parsing without any data, and this is because the parser is unable to open the file elements to parse like for example " <page> <title> Heading For The Article </title> </page> is not being read and called back to parse the array of articles/words that we need. So I have added the missing callback triggers for reading the texts inside those tags.
added filtering to omit the redirect links to various sources and name space pages
I have also verified it by downloading 2 enwiki_dump files and trying to parse them into words 👇
@axif0 @andrewtavis @DeleMike Please have a look and lemme know if there are any corrections to be made or if my approach in understanding the issue is wrong
#641
Thank you for the pull request! ❤️
The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊
Maintainer Checklist
The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)
Let's review this in the call we have later, all :)
I will review today. Well done! ✨
All looks okay @priyankaforu!
I would love to test it locally. I did not see the steps to reproduce the issue you had in #641, so it is hard for me to have the complete experience like you describe in the issue.
Could you add this, please?
Hey @DeleMike , did you try downloading autosuggestions from wiki data ? Because for autosuggestions the issue is , parser is not parsing the characters inside the
Okay, thanks. I will connect w/ you in Matrix.