Scribe-Data
Scribe-Data copied to clipboard
Autosuggestions Data-type doesn't accessible
Terms
- [X] I have searched all open bug reports
- [X] I agree to follow Scribe-Data's Code of Conduct
Behavior
Currently in get.py, there is no logic set for Autosuggestions.
As in process_wiki.py function called gen_autosuggestions already defined. Can we can that in get.py ? If yes, I want to work on this issue.
Yes, this would be great, @axif0! But let's finish the current issues first :)
Discussing this a bit with @axif0 right now. We don't really think that the way that we're doing autosuggestions is something that would be applicable to other communities. Also, the long term plan is that we would actually not be using Wikipedia based calculated autosuggestions for this, but rather include a small LLM model in the language packs that would provide context based autosuggestions. Because of this, the work for this issue could potentially be to simply remove autosuggestions as an option for the CLI, and we will then use the notebook as needed in the Scribe community.
CC @wkyoshida and @mhmohona
Hello @andrewtavis 👋🏼, I hope you are doing well ! I am eager to work on this bug. Can you asign me to it please 🙏🏽?
Hello @andrewtavis 👋🏼, I just made a PR about this issue here : https://github.com/scribe-org/Scribe-Data/pull/462 Please review it and let me know if I should modify something 😊.
For this one I think we're going to need to do some work to make it so that when someone passes -dt autosuggestions it will run the Wikipedia based processes. This would be great and would make the update of Scribe-iOS much easier. Basically:
- We should get rid of
gen_autosuggestions.ipynband move all of its functionality into a function that the CLI calls - We should only accept one language with
-dt autosuggestionsas we'll need to download a Wikipedia dump - In a similar way that we get down the Wikidata dump and parse for the Lexeme data, we'd then also get the autosuggestions for this
- We'll also need to remake the words to ignore/skip to make sure that these are not being included in the end autosuggestions
- These are in an older version of the language metadata file, and maybe we can just have one large list somewhere that's not in the metadata file 🤔
Do you have any further thoughts on the suggestions above, @axif0? This should just be hooking the wikipedia directory processes into the CLI, but might need some tweaks along the way.
hey @andrewtavis , I have some questions:
- what does gen_autosuggestions.ipynb do and whats it initially for?
- what is data type autosuggestions, is it generally different from other data types?
- to clarify, the decision of this issue is going along with the initial plan of using gen_autosuggestions from process_wiki.py, and not removing it?
Hey @catreedle 👋 Answers to your questions:
- The notebook as of now just kind of puts everything together and allows us to run the needed functions, but ideally we'd just control the functions via the CLI
- Yes autosuggestions are different as we're generating them ourselves :) It's basically a word in the main column, and then we have three columns after this for the words that we'd suggest the user in the end applications.
- Yes let's definitely use the functions we have, and have them be controlled by the CLI 😊
Let us know if you have further questions!
Thank you @andrewtavis for now I haven't been able to run the notebook.
The import of display likely changed in a new version. You can also just not display things and print them instead :)
so for the CLI autosuggestions, is the process steps by steps similar to that of the notebook that we need to
- Download and parse wiki,
- Process and clean
- generate autosuggestions?
and how long downloading Wikipedia dump usually take?
I would say we can do process/clean and generating the suggestions in one go, as there's nothing else we can do with a cleaned dump except generate the suggestions. And downloading the dump can take an hour or two depending on the size. You could use a dump that's not English Wikipedia so it's quicker?
Quick note for us: We should do a check of the docs after this as we'll need to fix some of it given the new/deleted files.
I would say we can do process/clean and generating the suggestions in one go, as there's nothing else we can do with a cleaned dump except generate the suggestions. And downloading the dump can take an hour or two depending on the size. You could use a dump that's not English Wikipedia so it's quicker?
Quick note for us: We should do a check of the docs after this as we'll need to fix some of it given the new/deleted files.
Ok. Thank you :)
Assigning you given the conversation above, @catreedle, but let us know if you need further help! Feel free to open up a PR early for this as well and we can check the progress 😊
hi @andrewtavis , @axif0 I opened a PR here I'm not really sure where to go from here. would need some guidance 😊 I noticed that download_wiki and parse_to_ndjson already have built-in checks to see if downloaded dumps and parsed Wikipedia articles exist, so these steps will be skipped.
some questions:
- will it be necessary to let the user be able to force re-download?
- also, do users need to be able to specify dump_id?
any feedbacks or guidance would be helpful :)
Thanks for getting to this, @catreedle! Really is such a help 😊 I think that mirroring the Wikidata dump download process for this would be good. So yes let's give them the option to download a specific dump, and if they have latest already let's give them the option to redownload.
Let's add those in and address the license header errors and unit testing errors on the PR. Maybe the unit tests would be more complex, so we can help you with that 😊
Thanks again!
hey @andrewtavis @axif0,
Sorry for the delay. I've just added options for dump_id and force_download to the PR and a new interactive command get autosuggestions.
If you would like to take a look and let me know if you have any feedback. I'll continue working on the license header errors and unit testing later 😊
Closed by #595 🚀 Thanks for the great work here, @catreedle! Sorry the review took a bit longer than expected :) Looking forward to the next issue over here (or Scribe-Android, by the looks of it 😊)!
Thank you for the wrap-up @andrewtavis. looking forward to the next one 😊