rdflib icon indicating copy to clipboard operation
rdflib copied to clipboard

How can I check for values to be valid URIRefs?

Open sabinem opened this issue 3 years ago • 11 comments

When I try to parse a Graph such as this here below, I get an error about a not well formed URIRef:

<?xml version="1.0" encoding="utf-8" ?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dcat="http://www.w3.org/ns/dcat#"
         xmlns:dc="http://purl.org/dc/terms/">

  <dcat:Catalog rdf:about="https://swisstopo/catalog-endpoint.rdf">
    <dcat:dataset>
      <dcat:Dataset rdf:about="https://swisstopo/data/123">
        <dc:title xml:lang="en">some title</dc:title>
        <dc:description xml:lang="en">some dataset</dc:description>
        <dcat:accrualPeriodicity about="Hello World"/>
      </dcat:Dataset>
    </dcat:dataset>

    <dcat:dataset>
      <dcat:Dataset rdf:about="https://swisstopo/data/345">
        <dc:title xml:lang="en">some title</dc:title>
        <dc:description xml:lang="en">some dataset</dc:description>
      </dcat:Dataset>
    </dcat:dataset>

  </dcat:Catalog>

</rdf:RDF>
>g.parse('catalog,rdf')
ParserError: file:///home/sabine/Work/ckan-docker/catalog.rdf:11:8: Invalid property attribute URI: http://www.w3.org/1999/02/22-rdf-syntax-ns#about

So far so good. But when I build that catalog: how can I check that "Hello World" is not a valid URIRef?

>from rdflib import URIRef
>In [19]: x = URIRef("Hello world")                                                                                   
Hello world does not look like a valid URI, trying to serialize this will break.

I get this warning, but I what I would like to have would be an error, so that I can react on it. Would you have any advice for me on that? What does rdflib offer in that regard on validating URIRefs?

sabinem avatar Nov 16 '21 12:11 sabinem

Unfortunately, looks like when you create a URIRef, it only logs a warning and does not raise an exception until you try to serialize the graph.

You can do something similar with the code below.

from urllib.parse import quote_plus

from rdflib import URIRef, Graph, RDF, SDO
from rdflib.term import _is_valid_uri


class InvalidURIRef(Exception):
    """Raise when URIs are invalid."""
    pass


def create_uriref(uri):
    """Create a URIRef with the same validation func used by URIRef

    Except here an InvalidURIRef exception is raised.

    The RDFLib _is_valid_uri only checks for invalid characters. You can also add your
    own validation checks.
    """
    if _is_valid_uri(uri):
        return URIRef(uri)
    raise InvalidURIRef(f'"{uri}" is not a valid URI.')


# Note the use of @base. Therefore, local names are valid URIRef in RDFLib as long
# as you don't use the 'ntriples' serializer.
g = Graph(base=URIRef("https://example.com/"))

uri_value = "Hello world"

try:
    uri = create_uriref(uri_value)
except InvalidURIRef:
    # Handle the "invalid" URI.
    uri = create_uriref(quote_plus(uri_value))

g.add((uri, RDF.type, SDO.Thing))

g.print()
"""
@base <https://example.com/> .
@prefix schema: <https://schema.org/> .

<Hello+world> a schema:Thing .
"""

I would have suggested creating a subclass of URIRef where in the constructor you can perform your own validation but there seems to be some issues in RDFLib related to subclassing URIRef. See https://github.com/RDFLib/rdflib/issues/1332.

edmondchuc avatar Dec 09 '21 23:12 edmondchuc

@aucampia @gjhiggins Is this issue still open? In case yes, could you please assign this issue to me.

nandikajain avatar May 08 '22 19:05 nandikajain

A simple solution to this would just be to raise an error instead of a warning. This can be done by replacing logger.warning( "%s does not look like a valid URI, trying to serialize this will break." % value with raise ValueError("The URI %s is not valid." % value) in the term.py file.

I agree that there needs to be more than a warning in this case. An error is raised when parsing an RDF graph, so it makes sense to raise an error when using the URIRef function as well.

The datatype 'string' entered is correct (in the test case mentioned above), however the format is invalid. This is why I have chosen to raise a ValueError (instead of a ParserError as shown in the case of parsing the graph).

aishaaijazahmad avatar May 13 '22 15:05 aishaaijazahmad

A simple solution to this would just be to raise an error instead of a warning.

Simple but ill-conceived IMO in that it would break the current usage and no-one's actually calling for an error to be raised but for guidance on how to check for values to be valid URIRefs.

An error is raised when parsing an RDF graph

Ah , no it isn't ...

def test_invalid_uri():
    import rdflib

    data = """@base <https://example.com/> .
@prefix schema: <https://schema.org/> .

<Hello+world> a schema:Thing .
"""
    g = rdflib.Graph(base=rdflib.URIRef("https://example.com/"))
    g.parse(data=data)
    gres = g.serialize(format="ttl")
    assert (
        gres
        == """@base <https://example.com/> .
@prefix schema: <https://schema.org/> .

<Hello+world> a schema:Thing .

"""
    )

@edmondchuc's comment: “See https://github.com/RDFLib/rdflib/issues/1332” identifies the crux of the issue, resolving that issue would also effectively resolve this issue in that it would provide a convenient programmatic solution.

ghost avatar May 13 '22 16:05 ghost

Thanks for the update! @sabinem has asked for there to be an error raised instead of a warning so she can react to it before serializing the graph. I have responded to her initial concern.

I'll keep a tab on issue #1332 and check to see if the subclassing problem of URIRef can be resolved.

aishaaijazahmad avatar May 13 '22 17:05 aishaaijazahmad

As regards @sabinem's request, it has the potential to break things for everyone else and in practice would introduce a (generally-undesired) dependency on a dedicated package such as rcf3987. As a programming issue, it is solvable in a relatively straightforward manner by running a preliminary check on all the URIRefs in the graph (RDFLib is primarily a library for working with RDF)

for triple in graph:
    for term in triple:
        if isinstance(term, rdflib.term.URIRef):
            rfc3987.parse(term)

That's why #1332 is the crux, it would enable the implementation of such a check to be fairly straightforward. In the interim, I'll use the monkeypatch technique to illustrate:

from typing import Optional
from urllib.parse import urljoin

import pytest
import rfc3987
import rdflib

# Re-implement __new__ with optional use_rfc3987 kwarg
def __strictchecknew__(
    cls, value: str, base: Optional[str] = None, use_rfc3987: Optional[bool] = False
) -> "URIRef":
    if base is not None:
        ends_in_hash = value.endswith("#")
        value = urljoin(base, value, allow_fragments=True)
        if ends_in_hash:
            if not value.endswith("#"):
                value += "#"
    # Check with rcf3987 if specified, otherwise behave as normal
    if use_rfc3987:
        rfc3987.parse(value)

    if not rdflib.term._is_valid_uri(value):
        rdflib.logger.warning(
            f"{value} does not look like a valid URI, trying to serialize this will break."
        )

    try:
        rt = str.__new__(cls, value)
    except UnicodeDecodeError:
        rt = str.__new__(cls, value, "utf-8")  # type: ignore[call-overload]
    return rt

# Monkeypatch the URIRef.__new__() method
setattr(rdflib.term.URIRef, "__new__", __strictchecknew__)

# Copy in some invalid URIs
invalid_uris = [
    ("Hello world"),
    ("<123>"),
    (" 1 2 3 "),
    ('-"-'),
    ("ba\tz"),
    ("{}"),
    ("a|b"),
    ("1\\2"),
    ("^"),
    ("\U00100000"),
]

# Logged warning
@pytest.mark.parametrize(
    "invalid_uri",  invalid_uris,
)
def test_invalid_uri(invalid_uri: str) -> None:
    s = rdflib.URIRef(invalid_uri)

# Raise exception
@pytest.mark.parametrize(
    "invalid_uri", invalid_uris,
)
def test_invalid_uri_strict(invalid_uri: str) -> None:
    with pytest.raises(Exception):
        s = rdflib.URIRef(invalid_uri, use_rfc3987=True)

It's not clear from the included code above but as regards the change between __new__ and __strictnew__, the changes to rdflib.term.URIRef.__new__() are:

--- term.py	2022-05-08 18:19:38.545044964 +0100
+++ newterm.py	2022-05-13 18:51:56.526425677 +0100
@@ -262,7 +262,7 @@
     __neg__: Callable[["URIRef"], "NegatedPath"]
     __truediv__: Callable[["URIRef", Union["URIRef", "Path"]], "SequencePath"]
 
-    def __new__(cls, value: str, base: Optional[str] = None) -> "URIRef":
+    def __new__(cls, value: str, base: Optional[str] = None, use_rfc3987: Optional[bool] = False) -> "URIRef":
         if base is not None:
             ends_in_hash = value.endswith("#")
             # type error: Argument "allow_fragments" to "urljoin" has incompatible type "int"; expected "bool"
@@ -271,6 +271,8 @@
                 if not value.endswith("#"):
                     value += "#"
 
+        if use_rfc3987:
+            rfc3987.parse(value)
         if not _is_valid_uri(value):
             logger.warning(
                 "%s does not look like a valid URI, trying to serialize this will break."

This kind of thing is likely to be more useful when it is addressed in the forthcoming RDFLib "Cookbook".

ghost avatar May 13 '22 18:05 ghost

I look forward to the RDFLib Cookbook.

I can now see why simply raising an error will not work. The strictchecknew provides the check needed with the rfc3987 package and is an efficient way to check if the term is an instance of the URI class, but the subclassing issue with IRIs remains unsolved.

Thank you for illustrating how it could work in runtime with the use_rfc3987 kwarg. I will look into issue #1332, which you have referred to. Resolving that could close both issues.

aishaaijazahmad avatar May 14 '22 03:05 aishaaijazahmad

I will look into issue https://github.com/RDFLib/rdflib/issues/1332, which you have referred to. Resolving that could close both issues.

That would be useful.

I omitted to mention that any solution that uses rfc3987 will add significant time to the parsing and performance issues can be crucially important when dealing with large graphs.

ghost avatar May 14 '22 09:05 ghost

An option, either as a parameter or a global value, for URIRef to raise an error instead of a warning, with the default being warning, would provide much-needed flexibility to handle this at the source of the problem.

pdpark avatar May 26 '22 16:05 pdpark

See also https://github.com/RDFLib/rdflib/issues/625#issuecomment-1530276945 and https://github.com/RDFLib/rdflib/issues/625#issuecomment-1597893581

handrews avatar Jun 20 '23 00:06 handrews

Wow RDFLib doesn't cease to amaze (in a bad way)...

namedgraph avatar Apr 03 '24 13:04 namedgraph