kgtk
kgtk copied to clipboard
nodefile containing labels only for subjects
Describe the bug
When I use import-wikidata
with a subdump of wikidata, the nodefile.tsv contains only the Qnodes in subject position, while Pnodes and Qnodes in the object position are not present, thus I can't use the add-labels
command for them.
I am not sure whether the bug is from my side, e.g. how I generate the subdump, thus I include all files to reproduce this.
To Reproduce Steps to reproduce the behavior:
kgtk import-wikidata -i wikidata_object_types.bz2 --node objecttypes_nodefile.tsv --edge objecttypes_edgefile.tsv --qual objecttypes_qualfile.tsv --proc 64
files:
- subdump of wikidata (wikidata_object_types.bz2) I import: https://drive.google.com/file/d/16QxOVuReq3TGcwm7vZ4FsyyG2-X12xQo/view?usp=sharing
- generated nodefile: https://drive.google.com/file/d/1eUTWm8XtgUZJh6STAXy3cElt4bqOA0ew/view?usp=sharing
- examples of nodes (object position) in my wikidata subdump for which I cannot add the labels with the previous nodefile: https://drive.google.com/file/d/1UPbJDouwFQMbz8YCm5zA6EpoUUoyK1Ml/view?usp=sharing
- examples of nodes (predicate position) in my wikidata subdump for which I cannot add the labels with the previous nodefile: https://drive.google.com/file/d/1o1FDrMZaOvDKz4HXYbPsf7bVCp9oMzKv/view?usp=sharing
- examples of nodes (subject position) in my wikidata subdump for which I can add the labels with the previous nodefile: https://drive.google.com/file/d/1C6XEXmdVYk2idO8ZinUA6tgrvaEXrz6Z/view?usp=sharing
Expected behavior a nodefile containing all Qnodes and Pnodes in my subdump of Wikidata.
Additional context I'm using python 3.9
conda create -n kgtk-env39 python=3.9
conda activate kgtk-env39
conda install -c conda-forge graph-tool
pip install etk==2.2.8
pip --no-cache install -U kgtk
python -m spacy download en_core_web_sm
Hi @valecarriero ,
The nodefile will have only those Qnodes/Pnodes for which there is a json object in the input file. I looked at the sample nodes which you provided,
"Q3624078",
"Q43702",
"Q6256",
"Q20181813",
"Q185441",
"Q1250464",
"Q5107",
"Q82794",
"Q312461",
"Q11224256",
"P10",
"P1000",
"P10001",
"P10006",
"P10007",
"P10008",
"P1001",
"P10012",
"P10013",
"P10017"
There are no json objects for these nodes. So please check the way you created the sub dump and include the json objects for predicates and objects of the Qnodes in the dump.
Hi @saggu, thank you for the clarification! I realized this could be a possible explanation just yesterday, and your answer confirms that! I didn't realize at first that what you say here https://kgtk.readthedocs.io/en/latest/import/import_wikidata/ "A nodes file containing all Qnodes and Pnodes in Wikidata" was referred to the import-wikidata applied to the whole wikidata, or to subdumps complete with all subject, predicate and object jsons. It's clear now!
Thanks a lot, Valentina
@valecarriero, if you think we should add clarifications in the documentation please let us know!