extraction-framework icon indicating copy to clipboard operation
extraction-framework copied to clipboard

Very long property names in infobox-properties dataset

Open LorenzBuehmann opened this issue 5 years ago • 2 comments

Hi, not sure if intended, but looks like some properties in the infobox-properties dataset are quite long. And with long I mean very long ...

Dataset: http://dbpedia-generic.tib.eu/release/generic/infobox-properties/2019.10.01/infobox-properties_lang=en.ttl.bz2

bzcat infobox-properties_lang=en.ttl.bz2 | awk -F " " '{ print $2 }' | sort -u | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- > preds_sorted.txt

the longest properties shown with tail preds_sorted.txt are

<http://dbpedia.org/property/fernandoCoronilAndIStudiedInTheSameElementarySchoolInCaracas,Venezuela.ThisWas%22colegioAmérica%22InTheSectionSanBernardinoInCaracas.ThisSchoolDoesn'tExistAnymoreSinceSeveralDecades>
<http://dbpedia.org/property/theTrueSelfIsItselfJustThatPureConsciousness,WithoutWhichNothingCanBeKnownInAnyWay.(...)AndThatSameTrueSelf,PureConsciousness,IsNotDifferentFromTheUltimateWorldPrinciple,Brahman&nbsp;(...)Brahman(%3Cnowiki%3E_>
<http://dbpedia.org/property/%22Cis2%5Ctimes2/3%7BB8(AGis)%7DCis(E)Cis4%5Ctimes2/3%7BC8(DA)%7D%5Ctimes2/3%7BB(Cis%3FGis)%7D%5Ctimes2/3%7BA%5Cdim(Bis%5C!Dis%3F%7D%5Ctimes2/3%7BEisFisA)%7DGis4(Fis)%7D%3C/score%3E;excerpt11(violin)%3CscoreVorbis>
<http://dbpedia.org/property/''borderBreak''*FiscalYearEnded31March2010¥3.3&nbsp;billion*FiscalYearEnded31March2011¥2.5&nbsp;billion*FiscalYearEnded31March2012¥2.3&nbsp;billion*1stQuarterEnded30June2012¥0.5&nbsp;billion*CurrencyConversion**¥3.3Billion>
<http://dbpedia.org/property/''worldClubChampionFootballIntercontinentalClubs''*FiscalYearEnded31March2010¥4.2Billion*FiscalYearEnded31March2011¥3.8&nbsp;billion*FiscalYearEnded31March2012¥3.6&nbsp;billion*1stQuarterEnded30June2012¥0.5&nbsp;billion*CurrencyConversion**¥4.2Billion>
<http://dbpedia.org/property/vagueUseOfTermsLeadsToMistakes.TheTypeSite,Tul.Gh.,ShowsContinuityBetweenItsOwnLateNeolithicAndEarlyChalcolithicPhases.ThisDoesNotMeanThatThePhase/culture%22ghassulian%22,NamedAfterTheSite,IsIdenticalWithTheEntiretyOfTheLevantineChalcolithic,SoItsDatesShouldBeBasedOnAllGhassulianSites,NotJustT.Gh.ResultAStartingDateOf%22mid5m%22_>
<http://dbpedia.org/property/gomez&Silk%22thisSamadhiIsAtTheSameTimeTheCognitiveExperienceOfEmptiness,TheAttainmentOfTheAttributesOfBuddhahood,AndThePerformanceOfAVarietyOfPracticesOrDailyActivitiesOfABodhisattva—includingServiceAndAdorationAtTheFeetOfAllBuddhas.TheWordSamadhiIsAlsoUsedToMeanTheSūtraItself.Consequently,WeCanSpeakOfAnEquation,Sūtra%3Cnowiki%3E_>
<http://dbpedia.org/property/*''sukherAsukh''(2008)*''samudrajol''(2009)*''karoKonoNeetiNai''(2009)*''premomoyMriyoman''(2010)*''maanabJamin''(2010)*''achenaManush''(2010)*''sabujNakshotro''(2010)*''rumali''(2011)*''rongBerong''(2011)*''noProblem''(2011)*''dulchhePendulum''(2011)*''aamarBariTomarBar''(2011)*''ekPoloke''(2012),Ridom*''swapnoguloIchchemoto''(2012)*''phul+Pori%3Cnowiki%3E_>
<http://dbpedia.org/property/sparham%22tsongkhapaDoesNotAcceptSvātantra(“autonomous”)Reasoning(theFourthPoint).HeAssertsThatItIsEnough,WhenProvingThatAnyGivenSubjectIsEmptyOfIntrinsicExistence,ToLeadTheInterlocutor,ThroughReasoning,ToTheUnwelcomeConsequences(prasaṅga)InTheirOwnUntenablePosition;ItIsNotNecessaryToDemonstrateTheThesisBasedOnReasoningThatPresupposesAnySortOfIntrinsic(%3Cnowiki%3E_>
<http://dbpedia.org/property/nevertheless,AccordingToBasuEtAl.(2016),TheAaaWereEarlySettlersInIndia,RelatedToTheAsi%22theAbsenceOfSignificantResemblanceWithAnyOfTheNeighboringPopulationsIsIndicativeOfTheAsiAndTheAaaBeingEarlySettlersInIndia,PossiblyArrivingOnThe“southernExit”WaveOutOfAfrica.DifferentiationBetweenTheAsiAndTheAaaPossiblyTookPlaceAfterTheirArrivalInIndia(admixtureAnalysisWithK%3Cnowiki%3E_>

I also tried with the latest (cleaned?) dataset available from the DBpedia account: https://databus.dbpedia.org/dbpedia/generic/infobox-properties/2019.08.30 Result is the same.

So, is this intended?

LorenzBuehmann avatar Nov 05 '19 09:11 LorenzBuehmann

@LorenzBuehmann not sure about this bug. In general, it is too much work to fix everything in generic. That is why there are mappings. Not sure if we would prioritise fixing this. Reason: In principle there can be any anomalies that produce some junk.

Does this have a big prevalence?

kurzum avatar Nov 05 '19 15:11 kurzum

Not really important for me, just came across this when I did vertical partitioning of the triples by predicate in Apache Parquet format on HDFS file system, which has a default file name length of 255. I was just surprised by the error because I never expected such a long property URI.

So you can mark it as "minor bug" or even "won't fix". But at least you could track those things, not sure what others might do with property URIs in general.

LorenzBuehmann avatar Nov 06 '19 07:11 LorenzBuehmann