extraction-framework
extraction-framework copied to clipboard
Very long property names in infobox-properties dataset
Hi, not sure if intended, but looks like some properties in the infobox-properties dataset are quite long. And with long I mean very long ...
Dataset: http://dbpedia-generic.tib.eu/release/generic/infobox-properties/2019.10.01/infobox-properties_lang=en.ttl.bz2
bzcat infobox-properties_lang=en.ttl.bz2 | awk -F " " '{ print $2 }' | sort -u | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- > preds_sorted.txt
the longest properties shown with tail preds_sorted.txt are
<http://dbpedia.org/property/fernandoCoronilAndIStudiedInTheSameElementarySchoolInCaracas,Venezuela.ThisWas%22colegioAmérica%22InTheSectionSanBernardinoInCaracas.ThisSchoolDoesn'tExistAnymoreSinceSeveralDecades>
<http://dbpedia.org/property/theTrueSelfIsItselfJustThatPureConsciousness,WithoutWhichNothingCanBeKnownInAnyWay.(...)AndThatSameTrueSelf,PureConsciousness,IsNotDifferentFromTheUltimateWorldPrinciple,Brahman (...)Brahman(%3Cnowiki%3E_>
<http://dbpedia.org/property/%22Cis2%5Ctimes2/3%7BB8(AGis)%7DCis(E)Cis4%5Ctimes2/3%7BC8(DA)%7D%5Ctimes2/3%7BB(Cis%3FGis)%7D%5Ctimes2/3%7BA%5Cdim(Bis%5C!Dis%3F%7D%5Ctimes2/3%7BEisFisA)%7DGis4(Fis)%7D%3C/score%3E;excerpt11(violin)%3CscoreVorbis>
<http://dbpedia.org/property/''borderBreak''*FiscalYearEnded31March2010¥3.3 billion*FiscalYearEnded31March2011¥2.5 billion*FiscalYearEnded31March2012¥2.3 billion*1stQuarterEnded30June2012¥0.5 billion*CurrencyConversion**¥3.3Billion>
<http://dbpedia.org/property/''worldClubChampionFootballIntercontinentalClubs''*FiscalYearEnded31March2010¥4.2Billion*FiscalYearEnded31March2011¥3.8 billion*FiscalYearEnded31March2012¥3.6 billion*1stQuarterEnded30June2012¥0.5 billion*CurrencyConversion**¥4.2Billion>
<http://dbpedia.org/property/vagueUseOfTermsLeadsToMistakes.TheTypeSite,Tul.Gh.,ShowsContinuityBetweenItsOwnLateNeolithicAndEarlyChalcolithicPhases.ThisDoesNotMeanThatThePhase/culture%22ghassulian%22,NamedAfterTheSite,IsIdenticalWithTheEntiretyOfTheLevantineChalcolithic,SoItsDatesShouldBeBasedOnAllGhassulianSites,NotJustT.Gh.ResultAStartingDateOf%22mid5m%22_>
<http://dbpedia.org/property/gomez&Silk%22thisSamadhiIsAtTheSameTimeTheCognitiveExperienceOfEmptiness,TheAttainmentOfTheAttributesOfBuddhahood,AndThePerformanceOfAVarietyOfPracticesOrDailyActivitiesOfABodhisattva—includingServiceAndAdorationAtTheFeetOfAllBuddhas.TheWordSamadhiIsAlsoUsedToMeanTheSūtraItself.Consequently,WeCanSpeakOfAnEquation,Sūtra%3Cnowiki%3E_>
<http://dbpedia.org/property/*''sukherAsukh''(2008)*''samudrajol''(2009)*''karoKonoNeetiNai''(2009)*''premomoyMriyoman''(2010)*''maanabJamin''(2010)*''achenaManush''(2010)*''sabujNakshotro''(2010)*''rumali''(2011)*''rongBerong''(2011)*''noProblem''(2011)*''dulchhePendulum''(2011)*''aamarBariTomarBar''(2011)*''ekPoloke''(2012),Ridom*''swapnoguloIchchemoto''(2012)*''phul+Pori%3Cnowiki%3E_>
<http://dbpedia.org/property/sparham%22tsongkhapaDoesNotAcceptSvātantra(“autonomous”)Reasoning(theFourthPoint).HeAssertsThatItIsEnough,WhenProvingThatAnyGivenSubjectIsEmptyOfIntrinsicExistence,ToLeadTheInterlocutor,ThroughReasoning,ToTheUnwelcomeConsequences(prasaṅga)InTheirOwnUntenablePosition;ItIsNotNecessaryToDemonstrateTheThesisBasedOnReasoningThatPresupposesAnySortOfIntrinsic(%3Cnowiki%3E_>
<http://dbpedia.org/property/nevertheless,AccordingToBasuEtAl.(2016),TheAaaWereEarlySettlersInIndia,RelatedToTheAsi%22theAbsenceOfSignificantResemblanceWithAnyOfTheNeighboringPopulationsIsIndicativeOfTheAsiAndTheAaaBeingEarlySettlersInIndia,PossiblyArrivingOnThe“southernExit”WaveOutOfAfrica.DifferentiationBetweenTheAsiAndTheAaaPossiblyTookPlaceAfterTheirArrivalInIndia(admixtureAnalysisWithK%3Cnowiki%3E_>
I also tried with the latest (cleaned?) dataset available from the DBpedia account: https://databus.dbpedia.org/dbpedia/generic/infobox-properties/2019.08.30 Result is the same.
So, is this intended?
@LorenzBuehmann not sure about this bug. In general, it is too much work to fix everything in generic. That is why there are mappings. Not sure if we would prioritise fixing this.
Reason:
In principle there can be any anomalies that produce some junk.
Does this have a big prevalence?
Not really important for me, just came across this when I did vertical partitioning of the triples by predicate in Apache Parquet format on HDFS file system, which has a default file name length of 255. I was just surprised by the error because I never expected such a long property URI.
So you can mark it as "minor bug" or even "won't fix". But at least you could track those things, not sure what others might do with property URIs in general.