Scribe-Data Autosuggestions Data-type doesn't accessible

Terms

[X] I have searched all open bug reports
[X] I agree to follow Scribe-Data's Code of Conduct

Behavior

Currently in get.py, there is no logic set for Autosuggestions.

As in process_wiki.py function called gen_autosuggestions already defined. Can we can that in get.py ? If yes, I want to work on this issue.

Oct 02 '24 10:10 axif0

Yes, this would be great, @axif0! But let's finish the current issues first :)

Oct 02 '24 14:10 andrewtavis

Discussing this a bit with @axif0 right now. We don't really think that the way that we're doing autosuggestions is something that would be applicable to other communities. Also, the long term plan is that we would actually not be using Wikipedia based calculated autosuggestions for this, but rather include a small LLM model in the language packs that would provide context based autosuggestions. Because of this, the work for this issue could potentially be to simply remove autosuggestions as an option for the CLI, and we will then use the notebook as needed in the Scribe community.

CC @wkyoshida and @mhmohona

Oct 05 '24 14:10 andrewtavis

Hello @andrewtavis 👋🏼, I hope you are doing well ! I am eager to work on this bug. Can you asign me to it please 🙏🏽?

Oct 19 '24 16:10 Collins-Webdev

Hello @andrewtavis 👋🏼, I just made a PR about this issue here : https://github.com/scribe-org/Scribe-Data/pull/462 Please review it and let me know if I should modify something 😊.

Oct 21 '24 20:10 Collins-Webdev

For this one I think we're going to need to do some work to make it so that when someone passes -dt autosuggestions it will run the Wikipedia based processes. This would be great and would make the update of Scribe-iOS much easier. Basically:

We should get rid of gen_autosuggestions.ipynb and move all of its functionality into a function that the CLI calls
We should only accept one language with -dt autosuggestions as we'll need to download a Wikipedia dump
In a similar way that we get down the Wikidata dump and parse for the Lexeme data, we'd then also get the autosuggestions for this
We'll also need to remake the words to ignore/skip to make sure that these are not being included in the end autosuggestions
- These are in an older version of the language metadata file, and maybe we can just have one large list somewhere that's not in the metadata file 🤔

Mar 16 '25 15:03 andrewtavis

Do you have any further thoughts on the suggestions above, @axif0? This should just be hooking the wikipedia directory processes into the CLI, but might need some tweaks along the way.

Mar 16 '25 15:03 andrewtavis

hey @andrewtavis , I have some questions:

what does gen_autosuggestions.ipynb do and whats it initially for?
what is data type autosuggestions, is it generally different from other data types?
to clarify, the decision of this issue is going along with the initial plan of using gen_autosuggestions from process_wiki.py, and not removing it?

Mar 18 '25 23:03 catreedle

Hey @catreedle 👋 Answers to your questions:

The notebook as of now just kind of puts everything together and allows us to run the needed functions, but ideally we'd just control the functions via the CLI
Yes autosuggestions are different as we're generating them ourselves :) It's basically a word in the main column, and then we have three columns after this for the words that we'd suggest the user in the end applications.
Yes let's definitely use the functions we have, and have them be controlled by the CLI 😊

Mar 19 '25 08:03 andrewtavis

Let us know if you have further questions!

Mar 19 '25 08:03 andrewtavis

Thank you @andrewtavis for now I haven't been able to run the notebook.

Mar 19 '25 15:03 catreedle

The import of display likely changed in a new version. You can also just not display things and print them instead :)

Mar 19 '25 16:03 andrewtavis

so for the CLI autosuggestions, is the process steps by steps similar to that of the notebook that we need to

Download and parse wiki,
Process and clean
generate autosuggestions?

and how long downloading Wikipedia dump usually take?

Mar 19 '25 16:03 catreedle

I would say we can do process/clean and generating the suggestions in one go, as there's nothing else we can do with a cleaned dump except generate the suggestions. And downloading the dump can take an hour or two depending on the size. You could use a dump that's not English Wikipedia so it's quicker?

Quick note for us: We should do a check of the docs after this as we'll need to fix some of it given the new/deleted files.

Mar 19 '25 16:03 andrewtavis

I would say we can do process/clean and generating the suggestions in one go, as there's nothing else we can do with a cleaned dump except generate the suggestions. And downloading the dump can take an hour or two depending on the size. You could use a dump that's not English Wikipedia so it's quicker?

Quick note for us: We should do a check of the docs after this as we'll need to fix some of it given the new/deleted files.

Ok. Thank you :)

Mar 19 '25 16:03 catreedle

Assigning you given the conversation above, @catreedle, but let us know if you need further help! Feel free to open up a PR early for this as well and we can check the progress 😊

Mar 19 '25 21:03 andrewtavis

hi @andrewtavis , @axif0 I opened a PR here I'm not really sure where to go from here. would need some guidance 😊 I noticed that download_wiki and parse_to_ndjson already have built-in checks to see if downloaded dumps and parsed Wikipedia articles exist, so these steps will be skipped.

some questions:

will it be necessary to let the user be able to force re-download?
also, do users need to be able to specify dump_id?

any feedbacks or guidance would be helpful :)

Mar 23 '25 23:03 catreedle

Thanks for getting to this, @catreedle! Really is such a help 😊 I think that mirroring the Wikidata dump download process for this would be good. So yes let's give them the option to download a specific dump, and if they have latest already let's give them the option to redownload.

Let's add those in and address the license header errors and unit testing errors on the PR. Maybe the unit tests would be more complex, so we can help you with that 😊

Thanks again!

Mar 24 '25 08:03 andrewtavis

hey @andrewtavis @axif0, Sorry for the delay. I've just added options for dump_id and force_download to the PR and a new interactive command get autosuggestions.

If you would like to take a look and let me know if you have any feedback. I'll continue working on the license header errors and unit testing later 😊

Apr 04 '25 23:04 catreedle

Closed by #595 🚀 Thanks for the great work here, @catreedle! Sorry the review took a bit longer than expected :) Looking forward to the next issue over here (or Scribe-Android, by the looks of it 😊)!

Apr 21 '25 15:04 andrewtavis

Thank you for the wrap-up @andrewtavis. looking forward to the next one 😊

Apr 24 '25 22:04 catreedle

Scribe-Data Scribe-Data copied to clipboard

Autosuggestions Data-type doesn't accessible

Terms

Behavior

Scribe-Data
Scribe-Data copied to clipboard