kgtk icon indicating copy to clipboard operation
kgtk copied to clipboard

nodefile containing labels only for subjects

Open valecarriero opened this issue 2 years ago • 3 comments

Describe the bug When I use import-wikidata with a subdump of wikidata, the nodefile.tsv contains only the Qnodes in subject position, while Pnodes and Qnodes in the object position are not present, thus I can't use the add-labels command for them. I am not sure whether the bug is from my side, e.g. how I generate the subdump, thus I include all files to reproduce this.

To Reproduce Steps to reproduce the behavior:

kgtk import-wikidata -i wikidata_object_types.bz2 --node objecttypes_nodefile.tsv --edge objecttypes_edgefile.tsv --qual objecttypes_qualfile.tsv --proc 64

files:

  1. subdump of wikidata (wikidata_object_types.bz2) I import: https://drive.google.com/file/d/16QxOVuReq3TGcwm7vZ4FsyyG2-X12xQo/view?usp=sharing
  2. generated nodefile: https://drive.google.com/file/d/1eUTWm8XtgUZJh6STAXy3cElt4bqOA0ew/view?usp=sharing
  3. examples of nodes (object position) in my wikidata subdump for which I cannot add the labels with the previous nodefile: https://drive.google.com/file/d/1UPbJDouwFQMbz8YCm5zA6EpoUUoyK1Ml/view?usp=sharing
  4. examples of nodes (predicate position) in my wikidata subdump for which I cannot add the labels with the previous nodefile: https://drive.google.com/file/d/1o1FDrMZaOvDKz4HXYbPsf7bVCp9oMzKv/view?usp=sharing
  5. examples of nodes (subject position) in my wikidata subdump for which I can add the labels with the previous nodefile: https://drive.google.com/file/d/1C6XEXmdVYk2idO8ZinUA6tgrvaEXrz6Z/view?usp=sharing

Expected behavior a nodefile containing all Qnodes and Pnodes in my subdump of Wikidata.

Additional context I'm using python 3.9

conda create -n kgtk-env39 python=3.9
conda activate kgtk-env39
conda install -c conda-forge graph-tool
pip install etk==2.2.8
pip --no-cache install -U kgtk
python -m spacy download en_core_web_sm

valecarriero avatar Mar 15 '22 10:03 valecarriero

Hi @valecarriero ,

The nodefile will have only those Qnodes/Pnodes for which there is a json object in the input file. I looked at the sample nodes which you provided,

"Q3624078",
"Q43702",
"Q6256",
"Q20181813",
"Q185441",
"Q1250464",
"Q5107",
"Q82794",
"Q312461",
"Q11224256",
"P10",
"P1000",
"P10001",
"P10006",
"P10007",
"P10008",
"P1001",
"P10012",
"P10013",
"P10017"

There are no json objects for these nodes. So please check the way you created the sub dump and include the json objects for predicates and objects of the Qnodes in the dump.

saggu avatar Mar 21 '22 18:03 saggu

Hi @saggu, thank you for the clarification! I realized this could be a possible explanation just yesterday, and your answer confirms that! I didn't realize at first that what you say here https://kgtk.readthedocs.io/en/latest/import/import_wikidata/ "A nodes file containing all Qnodes and Pnodes in Wikidata" was referred to the import-wikidata applied to the whole wikidata, or to subdumps complete with all subject, predicate and object jsons. It's clear now!

Thanks a lot, Valentina

valecarriero avatar Mar 22 '22 08:03 valecarriero

@valecarriero, if you think we should add clarifications in the documentation please let us know!

dgarijo avatar Mar 22 '22 09:03 dgarijo