kglab Refactor KnowledgeGraph class

Currently KnowledgeGraph class is implemented in ~1000 lines of code. To improve design, readability and accessibility we should consider refactoring. I would suggest considering composition by grouping the different methods in components classes like:

class Loaders:
    ...
    def load_rdf(...):
        ...
    def load_jsonld(...):
       ...

class KnowledgeGraph:
    def __init__(...):
        self.loaders = Loaders()
    ...

So that the methods can be moved to different files like kglab/loaders.py and have a clearer layout of the code.

After refactoring the calls would look like:

import kglab

kg = KnowledgeGraph()
kg.loaders.load_rdf(...)

Mar 11 '22 10:03 Mec-iS

see tests modules in #238 for a proposal how to break-down the class into: "loaders", "savers", "skos_owl", "utils", "querying"

Mar 11 '22 10:03 Mec-iS

Currently the KnowledgeGraph class implements the following methods:

__init__
rdf_graph
add_ns
get_ns
get_ns_dict
describe_ns
get_context
encode_date
add
remove
load_rdf
load_rdf_text
save_rdf
save_rdf_text
load_jsonld
save_jsonld
load_parquet
save_parquet
load_csv
materialize
import_roam
query
query_as_df
visualize_query
n3fy
n3fy_row
validate
infer_owlrl_closure
infer_rdfs_closure
infer_rdfs_properties
infer_rdfs_classes
infer_skos_related
infer_skos_concept
infer_skos_hierarchical
infer_skos_transitive
infer_skos_symmetric_mappings
infer_skos_hierarchical_mappings

I agree with @Mec-iS that this can be refactored somewhat. However, I don't see the benefit in grouping e.g. load_rdf and load_jsonld into the same Loaders class. Personally, I would go for something along the lines of:

### In kglab/io/base.py
from abc import ABC, abstractmethod

class IO(ABC):
    @classmethod
    @abstractmethod
    def load(self, *args, **kwargs) -> "KnowledgeGraph":
        pass

    @classmethod
    @abstractmethod
    def save(self, kg, *args, **kwargs) -> None:
        pass

### In kglab/io/rdf.py
class RDF_IO(IO):
    @classmethod
    def load (
        path: IOPathLike,
        *,
        format: str = "ttl",
        base: str = None,
        **args: typing.Any,
        ) -> "KnowledgeGraph":
        ...

    @classmethod
    def save (
        kg,
        path: IOPathLike,
        *,
        format: str = "ttl",
        base: str = None,
        encoding: str = "utf-8",
        **args: typing.Any,
        ) -> None:

### In kglab/io/jsonld.py
class JSON_LD_IO(IO):
    @classmethod
    def load (
        path: IOPathLike,
        *,
        encoding: str = "utf-8",
        **args: typing.Any,
        ) -> "KnowledgeGraph":
        ...

    @classmethod
    def save (
        kg,
        path: IOPathLike,
        *,
        encoding: str = "utf-8",
        **args: typing.Any,
        ) -> None:
...

### In kglab/kglab.py
from kglab.io import RDF_IO, JSON_LD_IO

class KnowledgeGraph:
    def __init__(self, ...):
        ...

    load_rdf = RDF_IO.load
    loadf_jsonld = JSON_LD_IO.load
    ...
    save_rdf = RDF_IO.save
    save_jsonld = JSON_LD_IO.save
    ...

Then you can easily open up this work to open-source developers, as you can make it very clear that all it takes for someone to implement a new datatype is to implement the two class methods in the IO abstract class. This means the developer doesn't need to spit through and understand the 1500 lines in kglab.py to understand what to add/edit in order to get a new datatype implemented. Beyond that, the io classes are separated in a folder specifically for reading and writing, and we don't have the the issue that kglab.py is completetly filled with docstrings for these read/write methods.

However, if Lorenzo's method is preferred, then I would at least add backwards compatibility so that kg.load_rdf(...) is still possible, and simply becomes kg.loaders.load_rdf() under the hood.

Beyond that, I would consider that moving the implementation of a method elsewhere does not necessarily make the file more readable. For example, the SPARQL query methods make up ~140 lines or so, but only ~30 or so are the implementation. The rest is newlines, the function signature and the docstrings. If you only move the implementation (i.e. move it to another file, and import some function so the new implementation is only one line in kglab.py), then you won't actually gain any readability. Plus, you will have to duplicate the signature and docstring in that case. However, you can use the "hack" from my code snippets above to actually gain readability, as that also moves the signature and docstring as well.

Mar 11 '22 11:03 tomaarsen

@Mec-iS, @tomaarsen – this is really excellent. I've been worried about that module growing and growing...

A few points for consideration based on what's coming up:

We're preparing for a 1.0.x release and need to keep backwards compatibility until then at least
- Let's pursue the test structure that Lorenzo has started, using what Tom suggested above leveraging @abstractmethod then assigning to the base class
We shouldn't divide these methods based on functionality; instead based on integration
- to Tom's point, this way developers who come from any particular community don't have to "hunt" (like in, say, RDFlib) to find relevant integration points
- in lieu of loaders.* and storers.* subpaths, instead have an io subpath with modules w3c, parquet, jsonld
The parquet support will become a focal point for integration work in 1.0.x, so those two methods should be handled very gently :)
- I'm under NDA at NVIDIA plus some other companies involved, while our discussions here are public, so it's probably best to just preview by saying we'll soon have much tighter integration with Dask Distributed, Apache Arrow, NumPy / cuNumeric, and NetworkX / cuGraph
We'll soon have a "dual" mode for support of W3C technologies (e.g., RDF, SPARQL, SHACL, SKOS, RML, etc.) and for labeled property graphs (e.g., openCypher, APOC, etc.)
- much of this involves a newgraph.py module which is an RDFlib Storage plugin (Lorenzo: pure Py this time, not Cython)
PSL is about ready to become much more closely coupled with RDF predicates and Cypher relations

We also need to approach refactoring both subg and topo modules in a similar way.

Mar 11 '22 22:03 ceteri

Here's how I propose to divide these public methods into subpaths:

kglab core:

rdf_graph
add_ns
get_ns
get_ns_dict
describe_ns
get_context
encode_date
add
remove

io serialization (each abstract class has load and store):

parquet
- load_parquet
- save_parquet
w3c
- load_rdf
- load_rdf_text
- save_rdf
- save_rdf_text
jsonld TODO: can be deprecated since RDFlib absorbed rdflib-jsonld
- load_jsonld
- save_jsonld

import practices (which are either semantic data integration or weird, special cases – definitely, let's keep these out of io):

import_roam becomes roam
external_import.import_from_neo4j becomes neo4j TODO: refactor to move over here
materialize becomes rml
load_csv becomes csvw

query:

sparql
- query TODO rename as sparql_query
- query_as_df TODO rename as sparql_query_as_df
- n3fy
- n3fy_row
- visualize_query

w3c standards:

validate becomes validate_shacl (SHACL)
infer_owlrl_closure (OWL-RL)
infer_rdfs_closure (OWL-RL)
infer_rdfs_properties (OWL-RL)
infer_rdfs_classes (OWL-RL)
infer_skos_related (SKOS)
infer_skos_concept (SKOS)
infer_skos_hierarchical (SKOS)
infer_skos_transitive (SKOS)
infer_skos_symmetric_mappings (SKOS)
infer_skos_hierarchical_mappings (SKOS)

plus:

subgraph
topology

Mar 11 '22 22:03 ceteri

What great detail! I love it. I'm a tad busy, but if I have some time then I'll see if there's some room for my help - whether with commits or reviews.

Mar 13 '22 21:03 tomaarsen

@tomaarsen and @Mec-iS, one quick question:

In the @abstractmethod examples above these are @classmethod definitions. What's a good approach for method which will require instance member, e.g., access to kglib.KnowledgeGraph._g and others?

I've started on the query portion to see how this would look, and the WIP is in the base_class_refactor branch:

https://github.com/DerwenAI/kglab/blob/1f83702a6f369352a3c85e720a512a0913ee550a/kglab/kglab.py#L143
https://github.com/DerwenAI/kglab/blob/base_class_refactor/kglab/query/base.py
https://github.com/DerwenAI/kglab/blob/base_class_refactor/kglab/query/sparql.py

Mar 28 '22 02:03 ceteri

sorry I missed this comment @ceteri. Great work as usual!

Do you have ideas on how to carry this on? Are we done with this or what is the next milestone?

Aug 30 '22 16:08 Mec-iS

Use of Mixins #262

Aug 31 '22 15:08 Mec-iS

#263

Sep 01 '22 17:09 Mec-iS

#265 and we should be ok for now.

Sep 01 '22 18:09 Mec-iS