WikibaseIntegrator icon indicating copy to clipboard operation
WikibaseIntegrator copied to clipboard

BUG: Fix behaviour of MERGE_REFS_OR_APPEND when datavalue is a blank node

Open lubianat opened this issue 10 months ago • 10 comments

As a bot developer for Structured Data on Commons, I want to be able to use ActionIfExists.MERGE_REFS_OR_APPEND for cases where the value of the property is a blank node.

Some files in Wikimedia Commons use a modelling that includes some value as a value, e.g.

image

From https://commons.wikimedia.org/wiki/File:Beitrag_zur_Flora_Brasiliens_(Pl.12)(8227161802).jpg

In this case, the claim exists, but has no "datavalue", which leads to an error on this line: https://github.com/LeMyst/WikibaseIntegrator/blob/c69b84a9623430431040f612b51c17795b16f137/wikibaseintegrator/models/claims.py#L108

Here is how the JSON for the datavalue-less claim looks like:

{'mainsnak':
     {'snaktype': 'somevalue', 'property': 'P170'}, 
     'type': 'statement',
      'id': 'M42778810$04DF04D0-6234-4915-8A96-EE352E6EF350',
     'rank': 'normal',
     'qualifiers': {'P3267': [{'snaktype': 'value', 'property': 'P3267', 'datavalue': {'value': '61021753@N02', 'type': 'string'}}],
                           'P2093': [{'snaktype': 'value', 'property': 'P2093', 'datavalue': {'value': 'Biodiversity Heritage Library', 'type': 'string'}}],
                           'P2699': [{'snaktype': 'value', 'property': 'P2699', 'datavalue': {'value': 'https://www.flickr.com/people/biodivlibrary/', 'type': 'string'}}]}, 
     'qualifiers-order': ['P3267', 'P2093', 'P2699']}

Which is compared in this case to:

{'mainsnak': 
    {'snaktype': 'value', 'property': 'P170', 
    'datatype': 'wikibase-item',
     'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 131760409, 'id': 'Q131760409'}, 'type': 'wikibase-entityid'}}, 
     'type': 'statement',
     'rank': 'normal',
     'qualifiers': {'P518': [{'snaktype': 'value', 'property': 'P518', 'datatype': 'wikibase-item', 'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 112134971, 'id': 'Q112134971'}, 'type': 'wikibase-entityid'}}],
                          'P3831': [{'snaktype': 'value', 'property': 'P3831', 'datatype': 'wikibase-item', 'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 644687, 'id': 'Q644687'}, 'type': 'wikibase-entityid'}}]}, 
     'qualifiers-order': []}

I will try and prototype a solution here. Cheers!

This is the current code, by the way, in development: https://github.com/lubianat/bhl_sdc_exploration/tree/main/reconciliation_bot

lubianat avatar Jan 16 '25 11:01 lubianat

Inded some issue with comparison of qualifiers. See:

<Snak @fdbd40 _Snak__snaktype=<WikibaseSnakType.KNOWN_VALUE: 'value'> _Snak__property_number='P518' _Snak__hash=None _Snak__datavalue={'value': {'entity-type': 'item', 'numeric-id': 112134971, 'id': 'Q112134971'}, 'type': 'wikibase-entityid'} _Snak__datatype='wikibase-item'>

is different from

<Snak @515c70 _Snak__snaktype=<WikibaseSnakType.KNOWN_VALUE: 'value'> _Snak__property_number='P518' _Snak__hash='288716b1efb9e21850a034325ebeeb0089b4e2c2' _Snak__datavalue={'value': {'entity-type': 'item', 'numeric-id': 112134971, 'id': 'Q112134971'}, 'type': 'wikibase-entityid'} _Snak__datatype=None>

Though I am not exactly certain why, because the original statement in this case was written to Commons using the same code .

My current hypothesis is that the * _Snak__datatype* is being retrieved from the Wikibase as "None".

lubianat avatar Jan 16 '25 13:01 lubianat

So, I was able to circumvent the head issue, but I am running into a few other bugs — I am sorry, I am not very knowledgeable in the inner Wikibase structures.

In some part of the code, I test:

claim.quals_equal(claim, existing_claim):

This is yielding false for the following pair:

{'mainsnak':
 {'snaktype': 'value',
 'property': 'P1433',
 'datavalue': 
{'value': {'entity-type': 'item', 'numeric-id': 51446243, 'id': 'Q51446243'}, 'type': 'wikibase-entityid'}},
 'type': 'statement', 
 'id': 'M42778917$DDA05289-CA80-471C-97FB-BC2E073F2B28', 
 'rank': 'normal',
 'qualifiers': {'P518': [{'snaktype': 'value', 'property': 'P518', 
                       'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 112134971, 'id': 'Q112134971'}, 'type': 'wikibase-entityid'}}]}, 
 **'qualifiers-order': ['P518']**}

{'mainsnak': 
{'snaktype': 'value',
 'property': 'P1433', 
'datatype': 'wikibase-item',
 'datavalue': 
{'value': {'entity-type': 'item', 'numeric-id': 51446243, 'id': 'Q51446243'}, 'type': 'wikibase-entityid'}},
 'type': 'statement', 
'rank': 'normal',
'qualifiers': {'P518': [{'snaktype': 'value', 'property': 'P518', 
                                     **'datatype': 'wikibase-item',** 
                                     'datavalue': {'value': {'entity-type': 'item', 'numeric-id': 112134971, 'id': 'Q112134971'}, 'type': 'wikibase-entityid'}}]}, 
**'qualifiers-order': []**}

I could spot these 2 differences, t seems like the lack of the correct datatype was breaking things

lubianat avatar Jan 16 '25 13:01 lubianat

I am running some workarounds just to test if my mind is in the right direction. It seems like so: I modified the qual check to ignore the datatype and compare only values (not great, but seemingly needed)

The same issue is happening with the references: the ones retrieved from Commons come without datatype.

Old item references:

{'snaks': {'P854': [{'snaktype': 'value', 'property': 'P854', 'datavalue': {'value': 'https://www.biodiversitylibrary.org/bibliography/909', 'type': 'string'}}]}, 'snaks-order': ['P854']}

New item references:

{'snaks': {'P854': [{'snaktype': 'value', 'property': 'P854', 'datatype': 'url', 'datavalue': {'value': 'https://www.biodiversitylibrary.org/bibliography/909', 'type': 'string'}}]}, 'snaks-order': []}

It might be something in the way MediaInfo is representing snaks, not sure.

lubianat avatar Jan 16 '25 13:01 lubianat

Hello @lubianat , Thank you for your issue, the analyze and the merge request. I will need some time to review this and merge everything in the main branch.

LeMyst avatar Jan 16 '25 18:01 LeMyst

@LeMyst Thank you! Do take your time — I am still working on figuring out the details here. Please do not merge any of my code, it is mostly garbage at this point, I needed to quickly fix a bug. I will try and clean up the contributions.

lubianat avatar Jan 16 '25 19:01 lubianat

I tried to reproduce my own errors and fixes after a few hours and could not.

For some reason, I am unable to get the claims from the API, even for the test you shared:

Image

media.claims are empty.

I am going to take a break and retry tomorrow

lubianat avatar Jan 16 '25 19:01 lubianat

I am investigating more. I am not sure why it was getting the claims before but not now (probably because I changed the version of WikibaseIntegrator without properly keeping track).

The bug now does have some relation to this: https://phabricator.wikimedia.org/T149410

lubianat avatar Jan 16 '25 20:01 lubianat

@lubianat do you have a test that expose this particular issue on the latest release?

dpriskorn avatar Mar 09 '25 09:03 dpriskorn

@dpriskorn sorry, I did not do my homework well and did not document the details.

I kind of got one patch working for me and did not have the time to properly fix it upstream.

Are you also touching mediainfo?

Maybe this issue can be closed — it could very well have been a bug in my code, and not on WBI.

I know that I am using my fork and it works for me (and installing from pip does not).

This one, btw: https://github.com/lubianat/WikibaseIntegrator

lubianat avatar Mar 09 '25 12:03 lubianat

No, and I'm considering whether would be a good idea to deprecate support for MediaInfo until it gets covered by a stable interface policy. See #840

dpriskorn avatar Mar 09 '25 12:03 dpriskorn