Scribe-Data icon indicating copy to clipboard operation
Scribe-Data copied to clipboard

Fix/parsing logic error for sax

Open priyankaforu opened this issue 2 months ago • 6 comments

Contributor checklist


Description

Fix:

The enwiki's ".ndjson" was always parsing without any data, and this is because the parser is unable to open the file elements to parse like for example " <page> <title> Heading For The Article </title> </page> is not being read and called back to parse the array of articles/words that we need. So I have added the missing callback triggers for reading the texts inside those tags.

added filtering to omit the redirect links to various sources and name space pages

I have also verified it by downloading 2 enwiki_dump files and trying to parse them into words 👇

@axif0 @andrewtavis @DeleMike Please have a look and lemme know if there are any corrections to be made or if my approach in understanding the issue is wrong

image

#641

priyankaforu avatar Oct 10 '25 10:10 priyankaforu

Thank you for the pull request! ❤️

The Scribe-Data team will do our best to address your contribution as soon as we can. If you're not already a member of our public Matrix community, please consider joining! We'd suggest that you use the Element client as well as Element X for a mobile app, and definitely join the General and Data rooms once you're in. Also consider attending our bi-weekly Saturday dev syncs. It'd be great to meet you 😊

github-actions[bot] avatar Oct 10 '25 10:10 github-actions[bot]

Maintainer Checklist

The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

  • [ ] The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • [ ] The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

github-actions[bot] avatar Oct 10 '25 10:10 github-actions[bot]

Let's review this in the call we have later, all :)

andrewtavis avatar Oct 11 '25 10:10 andrewtavis

I will review today. Well done! ✨

DeleMike avatar Oct 15 '25 04:10 DeleMike

All looks okay @priyankaforu!

I would love to test it locally. I did not see the steps to reproduce the issue you had in #641, so it is hard for me to have the complete experience like you describe in the issue.

Could you add this, please?

Hey @DeleMike , did you try downloading autosuggestions from wiki data ? Because for autosuggestions the issue is , parser is not parsing the characters inside the tags, so to fix that we need to let Sax parser to read in between characters, and store them in the array. I just added the missing methods there :) If you still feel this is unclear, I can explain you better connecting with you, if possible whenever you have time

priyankaforu avatar Oct 15 '25 22:10 priyankaforu

Okay, thanks. I will connect w/ you in Matrix.

DeleMike avatar Oct 16 '25 07:10 DeleMike