Scribe-Data Include option to additionally retrieve external IDs for data

trafficstars

Terms

[X] I have searched open and closed data issues
[X] I agree to follow Scribe-Data's Code of Conduct

Languages

ALL

Description

This issue is to discuss an option (i.e. a flag perhaps) to also retrieve external IDs for data when running the data process (this is optional, as I'm thinking this should probably be something to opt-in, i.e. not the default behavior). On the Scribe-Server side, this information could be later useful for tracking when specific data points are new or have been updated in the external sources Scribe references, e.g. Wikidata. For those interested, it could also potentially be useful to see the IDs.

For nouns, verbs, and prepositions, this is likely the Wikidata lexemes.
For translations, autosuggestions, and emoji keywords - sources for these data points are from elsewhere - e.g. Wikipedia, Unicode CLDR, translation models. I believe these wouldn't really have IDs tied to them.. Considerations for Scribe-Server:
- I wonder if it could make sense to attempt to tie them to a matching Wikidata lexeme, but I'm still unsure as this likely could get messy.
- Is there anything else we could use that makes sense?
Also, would doing this even make sense?

Open for discussion! :blush::eyes:

Jan 16 '24 00:01 wkyoshida

Hey @wkyoshida 👋 FYI I made a new issue in iOS that speaks to this even being something that we could include in the app data files 😊 See https://github.com/scribe-org/Scribe-iOS/issues/400. What that's saying is when we have a verb conjugation not showing up, this could actually be a link to the Wikidata page for the given lexeme such that the person could then enter in the conjugation and have it show up in the next data download :)

Feb 24 '24 13:02 andrewtavis

It was decided in the dev sync to go ahead and already at least implement the first idea proposed in this issue:

For nouns, verbs, and prepositions, this is likely the Wikidata lexemes.

Created a different issue, #101, to track the work for this and actually decided to leave this issue open to continue the discussion on potential ideas for the second point:

For translations, autosuggestions, and emoji keywords...

Grabbing the lexemes though will already be a useful addition :grin:

Mar 11 '24 02:03 wkyoshida

Noting down some points here with long-term architecture in mind:

Translations will eventually come from Wikidata and will thus have LIDs
Autosuggestions will eventually come from included LLMs in the end applications
Emojis being CLDR based makes it hard to actually put IDs on them

The real interest here is lastModified, which for translations will be present. Maybe the solution for here is to get some kind of field in the emoji data that's for when the emoji data was last updated as a whole and then we can know when to include them in data transfers - i.e. local lastModified in emojis table is < that that's on Scribe-Server's version of the table. Then send the whole thing over, or we could have different lastModified for each emoji where if a change is made to add the emoji or change its keywords then the current timestamp is set?

CC @axif0: What do you think on the above? :)

Mar 16 '25 15:03 andrewtavis

Big thing, let's not focus on this for translations and autosuggestions as hopefully a year and a half from now it won't even be needed :)

Mar 16 '25 15:03 andrewtavis

we could have different lastModified for each emoji where if a change is made to add the emoji or change its keywords then the current timestamp is set?

I think the second approach—having a lastModified timestamp for each emoji, is the better option. as we’ll have a precise history of changes for each emoji then.

{
  "cheerful": [
    {
      "emoji": "😀",
      "is_base": false,
      "rank": 61,
      "lastModified": "2025-03-16T12:00:00Z"
    }
  ],
  "cheery": [
    {
      "emoji": "😀",
      "is_base": false,
      "rank": 61,
      "lastModified": "2025-03-16T12:00:00Z"
    }
  ]
}

Do we need to convert the emoji_keywords.json into emoji_keywords.sqlite ?
When uploading in scribe-server, it should check the keys like cheerful or cheery ( Question: Are the keys unique?), if those keys are found, then we skips the data importing. if no keys match then we uploaded key into table, with last scribe-server data updated time.

Is this make scene ?

Mar 16 '25 16:03 axif0

Do we need to convert the emoji_keywords.json into emoji_keywords.sqlite ?

No it's just an emojis table within the language SQLite DB. Because of this, I think that lastModified for each keyword would be good so that the final columns can be keyword, last_modified, emoji_1, emoji_2, emoji_3 (btw these are renamed as I'm realizing that the current versions don't make much sense). Maybe we can also do emoji_4 just in case we ever want to do four emojis for tablets?

Your points in the second one make sense. We'll check the keyword to see if it doesn't exist or if the lastModified time is earlier than the current one, and if so we send along the data.

Let me know on the above! Maybe it makes sense for us to close this and make a new issue for the work we're describing?

Mar 17 '25 09:03 andrewtavis

(btw these are renamed as I'm realizing that the current versions don't make much sense). Maybe we can also do emoji_4 just in case we ever want to do four emojis for tablets?

German, emoji_keywords -

"fröhlich": [
{
Last Modified: last-server-upload_date.
"emoji": "😂",
"is_base": false,
"rank": 1
},
{
Last Modified: last-server-upload_date.
"emoji": "😁",
"is_base": false,
"rank": 12
},
{
Last Modified: last-server-upload_date.
"emoji": "🥳",
"is_base": false,
"rank": 30
}
],

In emoji sqlite file do we want like -

Keyword	Last Modified	Emoji_1	Rank_1	Is_Base_1	Emoji_2	Rank_2	Is_Base_2	Emoji_3	Rank_3	Is_Base_3	Emoji_4	Rank_4	Is_Base_4
fröhlich	YYYY-MM-DD	😂	1	False	😁	12	False	🥳	30	False	(NULL)	(NULL)	(NULL)

What do you think?

Mar 24 '25 09:03 axif0

After fixing emoji_keywords, I'll finalize the whole export circle including interactive mode, also I see the converting sqlite is working only --all cmd.

scribe-data c -a -ot sqlite

Also, using query for sub-language, exported json files saves as Hindustani_urdu. Shouldn't it save as `Hindustani/urdu/ ?

Dump and convert are following Hindustani/urdu/ convention.

Mar 24 '25 09:03 axif0

I'd say we should lower case all the column names, @axif0, but aside from that we're good :)

Mar 24 '25 10:03 andrewtavis

Scribe-Data Scribe-Data copied to clipboard

Include option to additionally retrieve external IDs for data

Terms

Languages

Description

Scribe-Data
Scribe-Data copied to clipboard