Scribe-Data
Scribe-Data copied to clipboard
Include option to additionally retrieve external IDs for data
Terms
- [X] I have searched open and closed data issues
- [X] I agree to follow Scribe-Data's Code of Conduct
Languages
ALL
Description
This issue is to discuss an option (i.e. a flag perhaps) to also retrieve external IDs for data when running the data process (this is optional, as I'm thinking this should probably be something to opt-in, i.e. not the default behavior). On the Scribe-Server side, this information could be later useful for tracking when specific data points are new or have been updated in the external sources Scribe references, e.g. Wikidata. For those interested, it could also potentially be useful to see the IDs.
-
For nouns, verbs, and prepositions, this is likely the Wikidata lexemes.
-
For translations, autosuggestions, and emoji keywords - sources for these data points are from elsewhere - e.g. Wikipedia, Unicode CLDR, translation models. I believe these wouldn't really have IDs tied to them.. Considerations for Scribe-Server:
- I wonder if it could make sense to attempt to tie them to a matching Wikidata lexeme, but I'm still unsure as this likely could get messy.
- Is there anything else we could use that makes sense?
-
Also, would doing this even make sense?
Open for discussion! :blush::eyes:
Hey @wkyoshida 👋 FYI I made a new issue in iOS that speaks to this even being something that we could include in the app data files 😊 See https://github.com/scribe-org/Scribe-iOS/issues/400. What that's saying is when we have a verb conjugation not showing up, this could actually be a link to the Wikidata page for the given lexeme such that the person could then enter in the conjugation and have it show up in the next data download :)
It was decided in the dev sync to go ahead and already at least implement the first idea proposed in this issue:
- For nouns, verbs, and prepositions, this is likely the Wikidata lexemes.
Created a different issue, #101, to track the work for this and actually decided to leave this issue open to continue the discussion on potential ideas for the second point:
- For translations, autosuggestions, and emoji keywords...
Grabbing the lexemes though will already be a useful addition :grin:
Noting down some points here with long-term architecture in mind:
- Translations will eventually come from Wikidata and will thus have LIDs
- Autosuggestions will eventually come from included LLMs in the end applications
- Emojis being CLDR based makes it hard to actually put IDs on them
The real interest here is lastModified, which for translations will be present. Maybe the solution for here is to get some kind of field in the emoji data that's for when the emoji data was last updated as a whole and then we can know when to include them in data transfers - i.e. local lastModified in emojis table is < that that's on Scribe-Server's version of the table. Then send the whole thing over, or we could have different lastModified for each emoji where if a change is made to add the emoji or change its keywords then the current timestamp is set?
CC @axif0: What do you think on the above? :)
Big thing, let's not focus on this for translations and autosuggestions as hopefully a year and a half from now it won't even be needed :)
we could have different
lastModifiedfor each emoji where if a change is made to add the emoji or change its keywords then the current timestamp is set?
I think the second approach—having a lastModified timestamp for each emoji, is the better option. as we’ll have a precise history of changes for each emoji then.
{
"cheerful": [
{
"emoji": "😀",
"is_base": false,
"rank": 61,
"lastModified": "2025-03-16T12:00:00Z"
}
],
"cheery": [
{
"emoji": "😀",
"is_base": false,
"rank": 61,
"lastModified": "2025-03-16T12:00:00Z"
}
]
}
- Do we need to convert the
emoji_keywords.jsoninto emoji_keywords.sqlite ? - When uploading in scribe-server, it should check the keys like
cheerfulorcheery( Question: Are the keys unique?), if those keys are found, then we skips the data importing. if no keys match then we uploaded key into table, with lastscribe-serverdata updated time.
Is this make scene ?
Do we need to convert the
emoji_keywords.jsoninto emoji_keywords.sqlite ?
No it's just an emojis table within the language SQLite DB. Because of this, I think that lastModified for each keyword would be good so that the final columns can be keyword, last_modified, emoji_1, emoji_2, emoji_3 (btw these are renamed as I'm realizing that the current versions don't make much sense). Maybe we can also do emoji_4 just in case we ever want to do four emojis for tablets?
Your points in the second one make sense. We'll check the keyword to see if it doesn't exist or if the lastModified time is earlier than the current one, and if so we send along the data.
Let me know on the above! Maybe it makes sense for us to close this and make a new issue for the work we're describing?
(btw these are renamed as I'm realizing that the current versions don't make much sense). Maybe we can also do
emoji_4just in case we ever want to do four emojis for tablets?
German, emoji_keywords -
"fröhlich": [
{
Last Modified: last-server-upload_date.
"emoji": "😂",
"is_base": false,
"rank": 1
},
{
Last Modified: last-server-upload_date.
"emoji": "😁",
"is_base": false,
"rank": 12
},
{
Last Modified: last-server-upload_date.
"emoji": "🥳",
"is_base": false,
"rank": 30
}
],
In emoji sqlite file do we want like -
| Keyword | Last Modified | Emoji_1 | Rank_1 | Is_Base_1 | Emoji_2 | Rank_2 | Is_Base_2 | Emoji_3 | Rank_3 | Is_Base_3 | Emoji_4 | Rank_4 | Is_Base_4 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| fröhlich | YYYY-MM-DD | 😂 | 1 | False | 😁 | 12 | False | 🥳 | 30 | False | (NULL) | (NULL) | (NULL) |
What do you think?
- After fixing
emoji_keywords, I'll finalize the whole export circle including interactive mode, also I see the converting sqlite is working only--allcmd.
scribe-data c -a -ot sqlite
- Also, using query for
sub-language, exported json files saves asHindustani_urdu. Shouldn't it save as `Hindustani/urdu/ ?
Dump and convert are following Hindustani/urdu/ convention.
I'd say we should lower case all the column names, @axif0, but aside from that we're good :)