rdflib
rdflib copied to clipboard
add chunk serializer & tests
Summary of changes
This file provides a single function serialize_in_chunks() which can serialize a
Graph into a number of NT files with a maximum number of triples or maximum file size.
There is an option to preserve any prefixes declared for the original graph in the first file, which will be a Turtle file.
Checklist
- [x] Checked that there aren't other open pull requests for the same change.
- [x] Added tests for any changes that have a runtime impact.
- [x] Checked that all tests and type checking passes.
- For changes that have a potential impact on users of this project:
- [x] Updated relevant documentation to avoid inaccuracies.
- [ ] Considered adding additional documentation.
- [ ] Considered adding an example in
./examplesfor new features. - [x] Considered updating our changelog (
CHANGELOG.md).
- [x] Considered granting push permissions to the PR branch, so maintainers can fix minor issues and keep your PR up to date.
I will have a look at the windows test failures on Monday (CET).
@gjhiggins @ashleysommer @aucampia what do you think of the approach here?
The need to chunk serialize files is a small one - a project I'm working on needs it - and I thought it interesting enough to make an RDFLib tool for, rather than just keeping the code within the project.
I've tried to be efficient in terms of memory usage - no duplicate graph objects etc. - and to faithfully serialize the graph but there may be smarter approaches.
pre-commit.ci autofix
I've tried to be efficient in terms of memory usage - no duplicate graph objects etc. - and to faithfully serialize the graph but there may be smarter approaches.
I think it is a fairly reasonable approach. It would have been nice if we had some way to do stream serialization to some text sink, but that is quite a big change and should probably be done with an abundance of caution and this seems like a reasonable approach in the interim.
I will maybe add some more tests on your branch if that is okay with you. Also happy to address comments I made, just let me know on the comment if you agree or disagree.
@gjhiggins @ashleysommer @aucampia what do you think of the approach here?
Minimally, it should indicate clearly that it is restricted to Graph serialization because, as the test below shows, context information is not preserved:
@pytest.mark.xfail(reason="Context information not preserved")
def test_chunking_of_conjunctivegraph():
nquads = """\
<http://example.org/alice> <http://purl.org/dc/terms/publisher> "Alice" .
<http://example.org/bob> <http://purl.org/dc/terms/publisher> "Bob" .
_:harry <http://purl.org/dc/terms/publisher> "Harry" .
_:harry <http://xmlns.com/foaf/0.1/name> "Harry" _:harry .
_:harry <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> _:harry .
_:alice <http://xmlns.com/foaf/0.1/name> "Alice" <http://example.org/alice> .
_:alice <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> <http://example.org/alice> .
_:bob <http://xmlns.com/foaf/0.1/name> "Bob" <http://example.org/bob> .
_:bob <http://xmlns.com/foaf/0.1/mbox> <mailto:[email protected]> <http://example.org/bob> .
_:bob <http://xmlns.com/foaf/0.1/knows> _:alice <http://example.org/bob> ."""
g = ConjunctiveGraph()
g.parse(data=nquads, format="nquads")
# make a temp dir to work with
temp_dir_path = Path(tempfile.TemporaryDirectory().name)
Path(temp_dir_path).mkdir()
# serialize into chunks file with 100 triples each
serialize_in_chunks(
g, max_triples=100, file_name_stem="chunk_100", output_dir=temp_dir_path
)
# check, when a graph is made from the chunk files, it's isomorphic with original
g2 = ConjunctiveGraph()
for f in Path(temp_dir_path).glob("*.nt"):
g2.parse(f, format="nt")
assert len(list(g.contexts())) == len(list(g2.contexts()))
The need to chunk serialize files is a small one - a project I'm working on needs it - and I thought it interesting enough to make an RDFLib tool for, rather than just keeping the code within the project.
RDFLib has traditionally been ambivalent about what's perceived as core vs non-core. Additional functionality appears to inevitably accrete, up to a point where it gets migrated out en masse into a separate package, the contents of which gradually become obsolete as they either fall out of use or are subsequently integrated into core library functionality.
Additional non-core functionality does have a regrettable tendency to languish in an untended and unkempt state. For instance, there's tools/graphisomorphism.py which is
- currently broken (and has been since 2018)
- long-obsolete, refererring as it does to
RDFaas a supported format and based on Sean B. Palmers's 2004rdfdiff.pyimplementation - Is subject to the same triples-only limitation.
- Is obsoleted in functionality by both
rdflib.compare.isomorphicand the weakerGraph.isomorphic()
Is it even worth bothering with a relatively trivial fix/update ...
diff --git a/rdflib/tools/graphisomorphism.py b/rdflib/tools/graphisomorphism.py
index 004b567b..75462eb9 100644
--- a/rdflib/tools/graphisomorphism.py
+++ b/rdflib/tools/graphisomorphism.py
@@ -27,6 +27,10 @@ class IsomorphicTestableGraph(Graph):
"""
return hash(tuple(sorted(self.hashtriples())))
+ def __hash__(self):
+ # return hash(tuple(sorted(self.hashtriples())))
+ return self.internal_hash()
+
def hashtriples(self):
for triple in self:
g = ((isinstance(t, BNode) and self.vhash(t)) or t for t in triple)
@@ -49,19 +53,19 @@ class IsomorphicTestableGraph(Graph):
else:
yield self.vhash(triple[p], done=True)
- def __eq__(self, G):
+ def __eq__(self, g):
"""Graph isomorphism testing."""
- if not isinstance(G, IsomorphicTestableGraph):
+ if not isinstance(g, IsomorphicTestableGraph):
return False
- elif len(self) != len(G):
+ elif len(self) != len(g):
return False
- elif list.__eq__(list(self), list(G)):
+ elif list.__eq__(list(self), list(g)):
return True # @@
- return self.internal_hash() == G.internal_hash()
+ return self.internal_hash() == g.internal_hash()
- def __ne__(self, G):
+ def __ne__(self, g):
"""Negative graph isomorphism testing."""
- return not self.__eq__(G)
+ return not self.__eq__(g)
def main():
@@ -82,10 +86,10 @@ def main():
default="xml",
dest="inputFormat",
metavar="RDF_FORMAT",
- choices=["xml", "trix", "n3", "nt", "rdfa"],
+ choices=["xml", "n3", "nt", "turtle", "trix", "trig", "nquads", "json-ld", "hext"],
help="The format of the RDF document(s) to compare"
- + "One of 'xml','n3','trix', 'nt', "
- + "or 'rdfa'. The default is %default",
+ + "One of 'xml', 'turtle', 'n3', 'nt', 'trix', 'trig', 'nquads', 'json-ld'"
+ + "or 'hext'. The default is %default",
)
(options, args) = op.parse_args()
when its appearance in tools is unlikely to persist for much longer?
There are a few non-core contributions in the closed PRs which I'm recruiting for preservaition in the cookbook. I'm guessing that a command-line version of graphisomorphism will ultimately end up there.
NOTE: I still have not had time to look at windows issue, will try tomorrow.
@nicholascar made some changes to your branch to get the tests to pass on windows:
-
Use bytes written as size instead of using
os.path.size(). The second option here is very dependent on OS behaviour and what is on disk. -
Add all open files to an exit stack so they are closed by the time the function returns.
-
Set encoding explicitly to utf-8 on opened files.
Coverage increased (+0.01%) to 90.458% when pulling 8647eb06542a188d12b676e70d3d26190b459bed on chunk_serializer into 131d9e66e8515aa81d776969d42f58c72bc68f86 on master.
Another commit:
-
Very that writing triple won't exceed max file size before writing instead of after writing.
This also necesitates using binary mode for file IO so that an accurate byte count can be obtained.
@nicholascar will finish this up in W32
pre-commit.ci autofix
I think this is good to merge now.
I will merge this later this week if there is no further feedback.