Refactor KnowledgeGraph class
Currently KnowledgeGraph class is implemented in ~1000 lines of code. To improve design, readability and accessibility we should consider refactoring. I would suggest considering composition by grouping the different methods in components classes like:
class Loaders:
...
def load_rdf(...):
...
def load_jsonld(...):
...
class KnowledgeGraph:
def __init__(...):
self.loaders = Loaders()
...
So that the methods can be moved to different files like kglab/loaders.py and have a clearer layout of the code.
After refactoring the calls would look like:
import kglab
kg = KnowledgeGraph()
kg.loaders.load_rdf(...)
see tests modules in #238 for a proposal how to break-down the class into: "loaders", "savers", "skos_owl", "utils", "querying"
Currently the KnowledgeGraph class implements the following methods:
- __init__
- rdf_graph
- add_ns
- get_ns
- get_ns_dict
- describe_ns
- get_context
- encode_date
- add
- remove
- load_rdf
- load_rdf_text
- save_rdf
- save_rdf_text
- load_jsonld
- save_jsonld
- load_parquet
- save_parquet
- load_csv
- materialize
- import_roam
- query
- query_as_df
- visualize_query
- n3fy
- n3fy_row
- validate
- infer_owlrl_closure
- infer_rdfs_closure
- infer_rdfs_properties
- infer_rdfs_classes
- infer_skos_related
- infer_skos_concept
- infer_skos_hierarchical
- infer_skos_transitive
- infer_skos_symmetric_mappings
- infer_skos_hierarchical_mappings
I agree with @Mec-iS that this can be refactored somewhat. However, I don't see the benefit in grouping e.g. load_rdf and load_jsonld into the same Loaders class. Personally, I would go for something along the lines of:
### In kglab/io/base.py
from abc import ABC, abstractmethod
class IO(ABC):
@classmethod
@abstractmethod
def load(self, *args, **kwargs) -> "KnowledgeGraph":
pass
@classmethod
@abstractmethod
def save(self, kg, *args, **kwargs) -> None:
pass
### In kglab/io/rdf.py
class RDF_IO(IO):
@classmethod
def load (
path: IOPathLike,
*,
format: str = "ttl",
base: str = None,
**args: typing.Any,
) -> "KnowledgeGraph":
...
@classmethod
def save (
kg,
path: IOPathLike,
*,
format: str = "ttl",
base: str = None,
encoding: str = "utf-8",
**args: typing.Any,
) -> None:
### In kglab/io/jsonld.py
class JSON_LD_IO(IO):
@classmethod
def load (
path: IOPathLike,
*,
encoding: str = "utf-8",
**args: typing.Any,
) -> "KnowledgeGraph":
...
@classmethod
def save (
kg,
path: IOPathLike,
*,
encoding: str = "utf-8",
**args: typing.Any,
) -> None:
...
### In kglab/kglab.py
from kglab.io import RDF_IO, JSON_LD_IO
class KnowledgeGraph:
def __init__(self, ...):
...
load_rdf = RDF_IO.load
loadf_jsonld = JSON_LD_IO.load
...
save_rdf = RDF_IO.save
save_jsonld = JSON_LD_IO.save
...
Then you can easily open up this work to open-source developers, as you can make it very clear that all it takes for someone to implement a new datatype is to implement the two class methods in the IO abstract class. This means the developer doesn't need to spit through and understand the 1500 lines in kglab.py to understand what to add/edit in order to get a new datatype implemented.
Beyond that, the io classes are separated in a folder specifically for reading and writing, and we don't have the the issue that kglab.py is completetly filled with docstrings for these read/write methods.
However, if Lorenzo's method is preferred, then I would at least add backwards compatibility so that kg.load_rdf(...) is still possible, and simply becomes kg.loaders.load_rdf() under the hood.
Beyond that, I would consider that moving the implementation of a method elsewhere does not necessarily make the file more readable. For example, the SPARQL query methods make up ~140 lines or so, but only ~30 or so are the implementation. The rest is newlines, the function signature and the docstrings. If you only move the implementation (i.e. move it to another file, and import some function so the new implementation is only one line in kglab.py), then you won't actually gain any readability. Plus, you will have to duplicate the signature and docstring in that case. However, you can use the "hack" from my code snippets above to actually gain readability, as that also moves the signature and docstring as well.
@Mec-iS, @tomaarsen – this is really excellent. I've been worried about that module growing and growing...
A few points for consideration based on what's coming up:
- We're preparing for a
1.0.xrelease and need to keep backwards compatibility until then at least- Let's pursue the test structure that Lorenzo has started, using what Tom suggested above leveraging
@abstractmethodthen assigning to the base class
- Let's pursue the test structure that Lorenzo has started, using what Tom suggested above leveraging
- We shouldn't divide these methods based on functionality; instead based on integration
- to Tom's point, this way developers who come from any particular community don't have to "hunt" (like in, say,
RDFlib) to find relevant integration points - in lieu of
loaders.*andstorers.*subpaths, instead have aniosubpath with modulesw3c,parquet,jsonld
- to Tom's point, this way developers who come from any particular community don't have to "hunt" (like in, say,
- The
parquetsupport will become a focal point for integration work in1.0.x, so those two methods should be handled very gently :)- I'm under NDA at NVIDIA plus some other companies involved, while our discussions here are public, so it's probably best to just preview by saying we'll soon have much tighter integration with Dask Distributed, Apache Arrow,
NumPy/cuNumeric, andNetworkX/cuGraph
- I'm under NDA at NVIDIA plus some other companies involved, while our discussions here are public, so it's probably best to just preview by saying we'll soon have much tighter integration with Dask Distributed, Apache Arrow,
- We'll soon have a "dual" mode for support of W3C technologies (e.g., RDF, SPARQL, SHACL, SKOS, RML, etc.) and for labeled property graphs (e.g.,
openCypher, APOC, etc.)- much of this involves a new
graph.pymodule which is an RDFlib Storage plugin (Lorenzo: pure Py this time, not Cython)
- much of this involves a new
- PSL is about ready to become much more closely coupled with RDF predicates and Cypher relations
We also need to approach refactoring both subg and topo modules in a similar way.
Here's how I propose to divide these public methods into subpaths:
kglab core:
-
rdf_graph -
add_ns -
get_ns -
get_ns_dict -
describe_ns -
get_context -
encode_date -
add -
remove
io serialization (each abstract class has load and store):
-
parquet-
load_parquet -
save_parquet
-
-
w3c-
load_rdf -
load_rdf_text -
save_rdf -
save_rdf_text
-
-
jsonldTODO: can be deprecated sinceRDFlibabsorbedrdflib-jsonld-
load_jsonld -
save_jsonld
-
import practices (which are either semantic data integration or weird, special cases – definitely, let's keep these out of io):
-
import_roambecomesroam -
external_import.import_from_neo4jbecomesneo4jTODO: refactor to move over here -
materializebecomesrml -
load_csvbecomescsvw
query:
-
sparql-
queryTODO rename assparql_query -
query_as_dfTODO rename assparql_query_as_df -
n3fy -
n3fy_row -
visualize_query
-
w3c standards:
-
validatebecomesvalidate_shacl(SHACL) -
infer_owlrl_closure(OWL-RL) -
infer_rdfs_closure(OWL-RL) -
infer_rdfs_properties(OWL-RL) -
infer_rdfs_classes(OWL-RL) -
infer_skos_related(SKOS) -
infer_skos_concept(SKOS) -
infer_skos_hierarchical(SKOS) -
infer_skos_transitive(SKOS) -
infer_skos_symmetric_mappings(SKOS) -
infer_skos_hierarchical_mappings(SKOS)
plus:
-
subgraph -
topology
What great detail! I love it. I'm a tad busy, but if I have some time then I'll see if there's some room for my help - whether with commits or reviews.
@tomaarsen and @Mec-iS, one quick question:
In the @abstractmethod examples above these are @classmethod definitions.
What's a good approach for method which will require instance member, e.g., access to kglib.KnowledgeGraph._g and others?
I've started on the query portion to see how this would look, and the WIP is in the base_class_refactor branch:
- https://github.com/DerwenAI/kglab/blob/1f83702a6f369352a3c85e720a512a0913ee550a/kglab/kglab.py#L143
- https://github.com/DerwenAI/kglab/blob/base_class_refactor/kglab/query/base.py
- https://github.com/DerwenAI/kglab/blob/base_class_refactor/kglab/query/sparql.py
sorry I missed this comment @ceteri. Great work as usual!
Do you have ideas on how to carry this on? Are we done with this or what is the next milestone?
Use of Mixins #262
#263
#265 and we should be ok for now.