kgx icon indicating copy to clipboard operation
kgx copied to clipboard

Ontology.json => ontology_nodes.tsv ; ontology_edges.tsv

Open hrshdhgd opened this issue 4 years ago • 5 comments

We're trying to update package dependencies from v 0.5.0 to v 1.1.0 (kg-covid-19; kg-microbe). For a knowledge graph generation from JSON as shown in the title, the old version used PandasTranformer and ObographJsonTransformer. I went through the KGX code and make an assumption the change involves switching to a generic Transformer class and using methods in that class. Here's the old code snippet.

Here's the new code snippet I'm trying to develop which could be entirely incorrect. Any help to make us understand what are we missing would be greatly appreciated. Thanks!

hrshdhgd avatar Jun 07 '21 19:06 hrshdhgd

UPDATE: I changed the code to use CLI utilities in KGX and this did the trick!


from kgx.cli.cli_utils import transform

transform(inputs=[data_file], input_format='obojson', output= os.path.join(self.output_dir, name), output_format='tsv')

Was my previous approach incorrect? What would be the correct replacement of code if we decide to take the Transformer() approach?

hrshdhgd avatar Jun 07 '21 21:06 hrshdhgd

I'm having this problem too - when I try to transform rdf/xml to TSV here:

        t = Transformer(stream=False)
        t.transform(input_args=input_args, output_args=output_args)

the transform runs okay, but there is no output. (If I change stream=False I get just the headers, but no other output.)

Adding: possibly my problem is at least partly because rdf/xml isn't no longer supported?

justaddcoffee avatar Jun 08 '21 22:06 justaddcoffee

@hrshdhgd Is this problem still there? I see you found a way to parse by using the cli.transform method.

@justaddcoffee RDF/XML was dropped because in KGX 0.x we parsed RDF/XML in memory, via RdfTransformer, which would read the entire file into memory and then translate the RDF graph to a Networkx graph.

In the KGX 1.x refactor, the architecture changed to support streaming. It is not feasible to add a streaming functionality for parsing RDF/XML (unless we rely on XML event parsers, which is less than ideal :) ).

There are two approaches to take here:

  • KGX supports parsing of all RDF serializations as before, but under the caveat that this parsing is happening via Rdflib and thus memory-intensive.
  • Alternatively, you could convert the RDF/XML to RDF NT which then can be supplied to KGX just like the other files.
# snippet for parsing RDF/XML and writing RDF NT
import rdflib
g = rdflib.Graph()
g.parse('/Users/unni/Downloads/lifted-go-cams-20200619.xml')
g.serialize(destination='/Users/unni/Downloads/lifted-go-cams-20200619.nt', format='nt')

deepakunni3 avatar Jul 25 '21 07:07 deepakunni3

If it has been fixed, I am not aware of it.

hrshdhgd avatar Jul 26 '21 13:07 hrshdhgd

Alternatively, you could convert the RDF/XML to RDF NT which then can be supplied to KGX just like the other files.

Thanks @deepakunni3 - that's just what I did, so it's good to know this was sensible. I did this conversion from xml -> nt manually via the KGX CLI, and put it up in our S3 bucket and ingesting from there:

https://github.com/Knowledge-Graph-Hub/kg-covid-19/blob/8c9150f6353841b9479fe2eac11673d4b117a291/download.yaml#L164

justaddcoffee avatar Jul 26 '21 17:07 justaddcoffee