OpenRefine icon indicating copy to clipboard operation
OpenRefine copied to clipboard

Support Wikidata lexemes

Open so9q opened this issue 5 years ago • 16 comments

To my knowledge this feature is missing. Openrefine should be able to do the same as LexData. https://github.com/Nudin/LexData

so9q avatar Dec 08 '19 19:12 so9q

Indeed! Lexemes are not supported at the moment. We would be very interested to add support for this, but as far as I am aware nobody has planned to spend time working on this. If anyone is keen, I would be happy to discuss potential architectures. One first step towards this happens in Wikidata-Toolkit: https://github.com/Wikidata/Wikidata-Toolkit/issues/437

It would be useful if you could give a few examples of the datasets you would import with OpenRefine if that feature was available. As a user, how do you expect this integration to work?

wetneb avatar Dec 09 '19 20:12 wetneb

Hi,

I would love to have a "simple" tool te edit Lexemes! (I tried LexData but my python-fu is low).

Here a simple example of a thing I want to do since a long time:

  • do a query to get Lexemes with no form (since all lexemes should have at least one form) : https://w.wiki/Dg7
  • maybe some cleaning and mumbo-jumbo (optional)
  • paste this results as form (voilà \o/)

belett avatar Dec 11 '19 10:12 belett

Thanks @belett! Just to make sure I understand correctly, you would take the lemma for each lexeme without a form, and add it as a form to that lexeme? Would you add any grammatical features to them?

wetneb avatar Dec 11 '19 10:12 wetneb

Thanks @belett! Just to make sure I understand correctly, you would take the lemma for each lexeme without a form, and add it as a form to that lexeme ?

Yes.

Would you add any grammatical features to them ?

Ideally yes, but that's the mumbo-jumbo part, I'm not sure how to do it :P (probably some guessing - lemmata for noun are mostly singular - and a lot of manual checking). Anyway, at some point, I definitely need a to add the grammatical features!

belett avatar Dec 11 '19 10:12 belett

To anyone interested in lexemes. You are welcome to contact me with tool-ideas. Today I wrote an new one because OpenRefine could not do what I wanted :)

dpriskorn avatar Apr 04 '21 19:04 dpriskorn

Hello there, I just wanted to let you know that we fixed the issue that was preventing to edit Senses and statements from wbeditentity (T199896) which we hope will help tool maintainers to support Lexemes. We would of course love to see OpenRefine supporting Lexemes as it would be helpful for many Wikidata editors :)

If you have questions, issues or requests, feel free to contact me (not on this account as it's my personal one, rather at [email protected]) Thanks!

Auregann avatar Apr 26 '21 14:04 Auregann

@wetneb Are we still blocked upstream on this? or can we remove that label? Would be nice to have a paragraph on what the current status is on this.. or better yet... Update the original comment with current state of things as we know so far.

thadguidry avatar Jun 30 '21 15:06 thadguidry

Yes, we still need https://github.com/Wikidata/Wikidata-Toolkit/issues/437 first. And also we need some progress on #3210, which should happen as part of the Wikimedia Commons project (#2144).

wetneb avatar Jun 30 '21 16:06 wetneb

Removing the "blocked upstream" label since Wikidata-Toolkit does support lexeme editing now.

wetneb avatar Oct 06 '21 07:10 wetneb

I just received a question via Telegram if this issue is up to date. I assume it is?

trnstlntk avatar Dec 01 '22 17:12 trnstlntk

Yes, it is current. To summarize, a number of road blocks have been lifted:

  • Wikidata-Toolkit supports lexeme editing
  • OpenRefine has the required architecture to edit different types of Wikibase entities, as demonstrated by the Wikimedia Commons integration

But lexeme support itself has not been started at all. What would be useful - especially coming from users requesting this integration - would be some descriptions of:

  • the sorts of imports you would want to do with this feature (creating new lexemes? adding statements to senses? and so on), and what your initial dataset would look like for that
  • for people with an inclination for design, a proposal (such as a visual mock-up, or textual description) of what lexeme editing would look like in OpenRefine, in particular given their nested structure (the ability to have forms and senses inside lexemes). Just describe the user experience you would expect as precisely as you can. What would it look like to create a new lexeme with three forms and four senses via OpenRefine? To add a sense to a lexeme? To add a statement to an existing sense in a lexeme? And so on.

I am not planning to work on this personally but I believe such descriptions would help prospective contributors to wrap their head around the issue. Feel free to add those in this GitHub issue or on our forum.

wetneb avatar Dec 01 '22 18:12 wetneb

This seems like a good thing for the Wikidata team to implement. It's well outside the core of OpenRefine.

Having said that, I'd much rather see them fix some basic stuff like their Search API first. That would be a much bigger benefit to the OpenRefine community by enabling a more useful and performant reconciliation service (they could support a production reconciliation service for that matter).

tfmorris avatar Dec 01 '22 19:12 tfmorris

I am not sure why you see it like that? I would say this is perfectly in scope for the Wikibase extension - it's about adding support for an entity type that is supported by Wikidata, the flagship Wikibase instance, so I don't see why it would be out of scope.

On top of implementing a new reconciliation service (or even three of them, if lexemes, forms and senses have separate endpoints), such an integration would involve changes to the Wikibase extension itself. I would say the latter is clearly outside of the remit of the Wikidata team.

That being said, it is true that this would inflate the size of the Wikibase extension quite a bit, and make it even more worth migrating to its own repository (#5282), potentially maintained by another team.

Ideally, the OpenRefine Wikibase extension itself would be extensible, and let other extensions define support for other entity types, mirroring the extensibility of Wikibase itself. We are currently quite far from that.

wetneb avatar Dec 01 '22 19:12 wetneb

What would be useful - especially coming from users requesting this integration - would be some descriptions of:

  • the sorts of imports you would want to do with this feature (creating new lexemes? adding statements to senses? and so on), and what your initial dataset would look like for that
  • for people with an inclination for design, a proposal (such as a visual mock-up, or textual description) of what lexeme editing would look like in OpenRefine, in particular given their nested structure (the ability to have forms and senses inside lexemes). Just describe the user experience you would expect as precisely as you can. What would it look like to create a new lexeme with three forms and four senses via OpenRefine? To add a sense to a lexeme? To add a statement to an existing sense in a lexeme? And so on.

To encourage broader participation I have also posted these questions on the forum: https://forum.openrefine.org/t/openrefine-support-for-lexemes-in-wikidata-how-would-you-use-this/216

trnstlntk avatar Dec 01 '22 19:12 trnstlntk

To anyone interested in lexemes. You are welcome to contact me with tool-ideas. Today I wrote an new one because OpenRefine could not do what I wanted :)

Hi. It would be very nice to have lexemes/senses/forms reconciliation, i.e. sth like https://ordia.toolforge.org/text-to-lexemes but adding a function for manually selecting the matching form (with the grammatical features that apply to the form in the text to link) and sense, and return the result as annotated text (a table with one text token per row and the manually validated form/sense matches in the 2nd column). That is, a tool for supervised text linking.

@dpriskorn, that would be sth like your subtitle linker, but for any TXT in a certain language one submits to the tool, plus the manual selection function for choosing the correct form/sense match.

dlindem avatar Nov 14 '23 09:11 dlindem

From the OpenRefine meetup at Wikimania:

Statement support on its own would be very valuable to editors as they could work on things like external identifiers which is highly valuable for Lexeme discovery purposes.

Given the support for non non wikibase#Entities types we could probably put such a support together with a rather minimal effort.

Abbe98 avatar Aug 09 '24 16:08 Abbe98

Since there has been discussion about what it would take to implement this in the OpenRefine-Wikimedia channel, here is a rough overview of what files would need changing to enable basic lexeme editing (just editing statements on lexemes, ignoring forms and senses entirely). All paths are understood in extension/wikibase. There might be more economical ways to add support for lexemes, probably more hacky but might be suitable for folks who just want something that works for them without submitting it as a PR.

Create Lexeme variants of those files (replacing MediaInfo by Lexeme and making the necessary adjustments to remove the file path, file name, wikitext and terms fields):

./src/org/openrefine/wikibase/schema/entityvalues/ReconMediaInfoIdValue.java
./src/org/openrefine/wikibase/schema/entityvalues/SuggestedMediaInfoIdValue.java
./src/org/openrefine/wikibase/schema/WbMediaInfoEditExpr.java
./src/org/openrefine/wikibase/updates/MediaInfoEdit.java
./src/org/openrefine/wikibase/updates/MediaInfoEditBuilder.java

and write the corresponding tests.

Wire up the newly created ReconLexemeIdValue (imitating the existing structure and adding a case for lexemes) in:

./src/org/openrefine/wikibase/editing/ReconEntityRewriter.java
./src/org/openrefine/wikibase/schema/WbEntityVariable.java

and write the corresponding tests.

Make any necessary small adaptations to ./src/org/openrefine/wikibase/editing/EditBatchProcessor.java to handle the editing itself, write the corresponding tests.

Make changes to ./module/scripts/schema-alignment.js to let the user add a new lexeme to the schema. This will primarily amount to imitating the SchemaAlignment._addItem / SchemaAlignment._itemToJSON functions, adapting them for lexemes (removing the terms box for instance), and wiring them up in SchemaAlignment._addEntity and SchemaAlignment._entityToJSON. Add localization strings introduced by the changes to schema-alignment ("add lexeme" button, for instance).

Make changes to the Wikidata manifest for it to announce that it supports lexemes.

Optionally, add scrutinizers to generate constraint checks for lexemes if needed. Perhaps the existing ones will need some adaptations if they mistakenly fire up on lexemes.

To use this integration, one would then need to use a Wikidata recon service with the "Use values as identifiers" operation to create reconciled cells from L-ids, that could then be used in the schema. Making a recon service for lexemes would of course make that more usable.

wetneb avatar Jan 10 '25 15:01 wetneb

@wetneb Thank you for writing this up! THIS is the kind of stuff that I love seeing from you! Not actually doing the work, but performing the analysis and high-level design enough so that others can come to the table and contribute with lower-level design and implementations into OpenRefine with more confidence on knowing which code areas to begin exploring and working on a solution. OpenRefine is a fairly large project now structurally and with you doing more of these styles of writeups makes it so much easier for growing our contributor base!

Kudos! Keep going with this style in more issues!

thadguidry avatar Jan 11 '25 04:01 thadguidry