rdflib icon indicating copy to clipboard operation
rdflib copied to clipboard

URI / IRI checking should complain about ASCII control chars and maybe properly check URIs

Open joernhees opened this issue 8 years ago • 19 comments

currently the _is_valid_URI method only checks if a URI includes any of the not allowed printable ASCII chars (see #620)

ASCII control chars like backspace, tab, ... won't lead to an early error / warning.

maybe we should really try to parse URIs with a python builtin lib or maybe https://pypi.python.org/pypi/rfc3987 ?

joernhees avatar May 10 '16 13:05 joernhees

There's an additional problem that IRIs aren't being handled properly either, at least in the SPARQL store:

XMLSyntaxError: xmlns:ns30: 'http://ja.dbpedia.org/resource/セントラル・' is not a valid URI, line 45, column 52 (line 45)

jpmccu avatar Oct 31 '17 18:10 jpmccu

@joernhees me, @guvi007 and @prince17080 would like to work on this issue, can we go forward with this?

reeshabhranjan avatar May 25 '20 18:05 reeshabhranjan

@joernhees Hey, we (me, @reeshabhkumarranjan, @guvi007) want to solve this issue. We came up with the solution that a valid URI can only contain characters among 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;='. If you think this is correct, I (@prince17080) can submit a pull request for the same.

prince17080 avatar May 25 '20 18:05 prince17080

This approach will exclude IRIs, which I think we are trying to support.

On Mon, May 25, 2020 at 2:30 PM prince17080 [email protected] wrote:

@joernhees https://github.com/joernhees Hey, we (me, @reeshabhkumarranjan https://github.com/reeshabhkumarranjan, @guvi007 https://github.com/guvi007) want to solve this issue. We came up with the solution that a valid URI can only contain characters among 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~:/?#[]@!$&'()*+,;='. If you think this is correct, I (@prince17080 https://github.com/prince17080) can submit a pull request for the same.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/issues/625#issuecomment-633678329, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCEIVQBR36HREDD2IBP3RTK2MJANCNFSM4CDMMQCA .

-- Jim McCusker

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute [email protected] [email protected] http://tw.rpi.edu

jpmccu avatar May 25 '20 18:05 jpmccu

Last I checked is_valid_uri is one of the most time consuming checks that is run by rdflib. I second @joernhees's original suggestion to use urlparse for this, but whatever implementation is ultimately used we will need performance numbers.

tgbugs avatar May 25 '20 20:05 tgbugs

Hi @joernhees @jpmccu @tgbugs, I, @UditArora2000 and @Ritvik1712 would like to address this issue. We are planning to work on some of the issues of the library as part of our course on Semantic Web. We think that instead of urlparse function present in inbuilt python urllib, the method mentioned by @joernhees in the initial issue would work better. We propose to use the match function as part of https://pypi.python.org/pypi/rfc3987 and then compare the possible matches based on the format type URI_reference or IRI_referecne https://github.com/dgerber/rfc3987/blob/master/rfc3987.py would be able to do the work. As for the last comment on performance number we would be glad to run the analysis and perform the needful.

CrypticGuy avatar Apr 23 '22 17:04 CrypticGuy

Hm, not quite as straightforward as it might first appear. fwiw, I decanted rfc3987 into its own file in rdflib/extras, added a small switch to rdflib/term.py:

diff --git a/rdflib/term.py b/rdflib/term.py
index c0ea4a28..44a919ff 100644
--- a/rdflib/term.py
+++ b/rdflib/term.py
@@ -70,6 +70,7 @@ from isodate import (
 
 import rdflib
 from rdflib.compat import long_type
+from rdflib.extras import rfc3987
 
 if TYPE_CHECKING:
     from .namespace import NamespaceManager
@@ -81,15 +82,26 @@ rdflib_skolem_genid = "/.well-known/genid/rdflib/"
 skolems: Dict[str, "BNode"] = {}
 
 
+def _is_valid_uri(uri: str) -> bool:
+    full_iri_check = True
+    return __is_valid_iri(uri) if full_iri_check else __is_valid_uri(uri)
+
+
 _invalid_uri_chars = '<>" {}|\\^`'
 
 
-def _is_valid_uri(uri: str) -> bool:
+def __is_valid_uri(uri: str) -> bool:
     for c in _invalid_uri_chars:
         if c in uri:
             return False
     return True
 
+def __is_valid_iri(iri: str) -> bool:
+    if not isinstance(iri, str):
+        iri = str(iri)
+    res = rfc3987.parse(iri)
+    return True if res is not None else False
+
 
 _lang_tag_regex = compile("^[a-zA-Z]+(?:-[a-zA-Z0-9]+)*$")
 

with the following rather interesting results:

Standard, full_iri_check=False

5406 passed, 24 skipped, 117 xfailed, 1 xpassed, 46 warnings in 54.56s

Experimental, full_iri_check=True

8 failed, 5398 passed, 24 skipped, 118 xfailed, 46 warnings in 58.01s

FAILED test/test_misc/test_parse_file_guess_format.py::TestFileParserGuessFormat::test_warning - ValueError: 'file:///tmp/tmppei8l7iv/?xml version="1.0"?' is not a valid 'IRI_reference'.
FAILED test/test_parsers/test_trig_w3c.py::test_manifest[file:///home/gjh/PyBench/rdflib-github-rfc3987/test/data/suites/w3c/trig/manifest.ttl#localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries-http://www.w3.org/ns/rdftest#TestTrigEval-rdf_test31]
FAILED test/test_parsers/test_turtle_w3c.py::test_manifest[file:///home/gjh/PyBench/rdflib-github-rfc3987/test/data/suites/w3c/turtle/manifest.ttl#localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries-http://www.w3.org/ns/rdftest#TestTurtleEval-rdf_test20]
FAILED test/test_serializers/test_serializer_turtle.py::testUnicodeEscaping - ValueError: 'http://example.com/zzz\U00100000zzz' is not a valid 'IRI_reference'.
FAILED test/test_sparql/test_dawg.py::test_dawg_data_sparql11[http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-query/manifest#test_pn_05-http://www.w3.org/2001/sw/DataAccess/tests/test-manifest#PositiveSyntaxTest11-rdf_test417]
FAILED test/test_sparql/test_dawg.py::test_dawg_data_sparql11[http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-query/manifest#test_pn_06-http://www.w3.org/2001/sw/DataAccess/tests/test-manifest#PositiveSyntaxTest11-rdf_test418]
FAILED test/test_sparql/test_dawg.py::test_dawg_data_sparql11[http://www.w3.org/2009/sparql/docs/tests/data-sparql11/syntax-query/manifest#test_pn_07-http://www.w3.org/2001/sw/DataAccess/tests/test-manifest#PositiveSyntaxTest11-rdf_test419]
FAILED test/test_sparql/test_service.py::test_service_with_implicit_select_and_base - ValueError: 'BASE <http://example.org/' is not a valid 'IRI_reference'.

ghost avatar Apr 24 '22 12:04 ghost

@gjhiggins we tried to fix something in a similar manner and faced the same issues. Specifically the tests test/test_parsers/test_trig_w3c.py::test_manifest[file:///home/gjh/PyBench/rdflib-github-rfc3987/test/data/suites/w3c/trig/manifest.ttl#localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries-http://www.w3.org/ns/rdftest#TestTrigEval-rdf_test31 are failing badly. We tried to analyze it and were stuck there. Is there any other approach you might recommend that could be worth trying?

CrypticGuy avatar May 07 '22 16:05 CrypticGuy

@gjhiggins we tried to fix something in a similar manner and faced the same issues. Specifically the tests test/test_parsers/test_trig_w3c.py::test_manifest[file:///home/gjh/PyBench/rdflib-github-rfc3987/test/data/suites/w3c/trig/manifest.ttl#localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries-http://www.w3.org/ns/rdftest#TestTrigEval-rdf_test31 are failing badly. We tried to analyze it and were stuck there. Is there any other approach you might recommend that could be worth trying?

I'm a bit puzzled, tbh. I don't see anything in the URIs of that particular test other then ordinary ASCII. The value of the object may well be rich in unicode codepoint examples but as it's a Literal, it would be inappropriate to treat and test it as a potentially valid URI. If you had included the test failure output, maybe I'd be less puzzled.

Exception: "http://a.example/AZazÀÖØöø˿Ͱͽ΄῾‌‍⁰↉Ⰰ⿕、ퟻ﨎ﷇﷰ￯��" does not look like a valid  URI, I cannot serialize this as N3/Turtle. Perhaps you wanted to urlencode it?

This is the error that we are getting. So, I'm not sure if it's being encoded properly.

@prefix p: <http://a.example/> .
<http://a.example/s> <http://a.example/p> p:AZazÀÖØöø˿ͰͽͿ῿‌‍⁰↏Ⰰ⿯、퟿﨎﷏ﷰ￯𐀀󯿽 .

The test case if I am not wrong is trying to check the above input for a valid URI.

CrypticGuy avatar May 07 '22 18:05 CrypticGuy

I remembered I'd created a branch for this so was able to return to it. Yeah, I misread the trig, it is supposed to be a URI as indicated by the prefixing p:

ghost avatar May 07 '22 18:05 ghost

I remembered I'd created a branch for this so was able to return to it. Yeah, I misread the trig, it is supposed to be a URI as indicated by the prefixing p:

I was also going through some of the unicode codes manually and all of them do exist in the rfc3987 unicode specifications. So, found it strange that the errors were thrown. Despite that the corrent check doesn't seem to be accurate as it is doing the bare minimum checks.

CrypticGuy avatar May 07 '22 20:05 CrypticGuy

As regards the original issue report, the topic of validating input and using rfc3987 as a supporting package was discussed in https://github.com/RDFLib/rdflib/issues/266 and gromgull's comment on using the rfc3987 module was:

I tested the rfc module, it is a bit slow - instead I just check for some invalid characters: <>" {}|^`

In my opinion this is a good tradeoff between speed, correctness and not allowing you to shoot yourself TOO much in the foot.

AIUI, the general advice for library code is "try to be tolerant of input but strict in output" so there isn't much of a role for rfc3987 as regards input-checking. However, if a user has a need to be strict about output, rfc3987 is a reasonable choice of tool to check the statements prior to serialization.

ghost avatar May 08 '22 16:05 ghost

I mean, in that scenario the presence of a simple check is offering better tradeoffs. And in case someone needs to implement stricter conditions they could do it manually from there end. So, in that regard should the issue be marked closed?

CrypticGuy avatar May 09 '22 12:05 CrypticGuy

So, in that regard should the issue be marked closed?

I'm disinclined to mark the issue as "closed", at least not until the discussion has been transcribed and edited into the documentation because it's a not-unreasonable expectation and it's appropriate that the implications of satisfying that expectation should be addressed as part of an explanation of why such a feature isn't present.

fwiw, adding rdf3987 checking to URIRef.__new__() adds significant time to parsing. With a trivial change such as:

diff --git a/rdflib/term.py b/rdflib/term.py
index c0ea4a28..200c7d62 100644
--- a/rdflib/term.py
+++ b/rdflib/term.py
@@ -67,6 +67,7 @@ from isodate import (
     parse_duration,
     parse_time,
 )
+import rfc3987
 
 import rdflib
 from rdflib.compat import long_type
@@ -271,11 +272,10 @@ class URIRef(IdentifiedNode):
                 if not value.endswith("#"):
                     value += "#"
 
-        if not _is_valid_uri(value):
-            logger.warning(
-                "%s does not look like a valid URI, trying to serialize this will break."
-                % value
-            )
+        # Handle DefinedNamespaceMeta namespaces (i.e. core namespaces)
+        value = str(value) if not isinstance(value, str) else value
+
+        rfc3987.parse(value)
 
         try:
             rt = str.__new__(cls, value)

On my i7 laptop, the time taken to parse 250K triples generated from the sp2b generator goes from:

[success] 100.00% test/test_wip.py::test_read_250ktriples: 14.8532s

to

[success] 100.00% test/test_wip.py::test_read_250ktriples: 24.1934s

Nevertheless, it is undeniably a useful feature when it's needed so it would seem helpful to provide some documented guidelines of how to implement this feature for oneself as and when necessary.

ghost avatar May 09 '22 17:05 ghost

Would it make sense to add a hook for URI/IRI validation? That way users could determine their own performance vs correctness tradeoffs ranging from no validation at all, to checking for forbidden characters, to rfc3987's regular expressions, or even using the abnf package. It would also be friendly to systems that already validate IRIs for other reasons prior to interacting with rdflib.

A hook would also make for very natural place to documentation expectations, defaults, and options.

handrews avatar May 01 '23 21:05 handrews

I like this approach!

On Mon, May 1, 2023 at 5:09 PM Henry Andrews @.***> wrote:

Would it make sense to add a hook for URI/IRI validation? That way users could determine their own performance vs correctness tradeoffs ranging from no validation at all, to checking for forbidden characters, to rfc3987's regular expressions, or even using the abnf package. It would also be friendly to systems that already validate IRIs for other reasons prior to interacting with rdflib.

A hook would also make for very natural place to documentation expectations, defaults, and options.

— Reply to this email directly, view it on GitHub https://github.com/RDFLib/rdflib/issues/625#issuecomment-1530276945, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAETCELWAOQAIQEUSHNNHUTXEAQ77ANCNFSM4CDMMQCA . You are receiving this because you were mentioned.Message ID: @.***>

-- Jamie McCusker (she/her/hers)

Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute @.*** @.***> http://tw.rpi.edu

jpmccu avatar May 01 '23 21:05 jpmccu

I took a shot at implementing the hook approach from my previous comment, and noticed some things:

There are three places where validation can occur, with variations in behavior

  1. On URIRef construction, where it will issue a logger.warning() on failure
  2. On serialization (URIRef.n3()), where it will raise an instance of Exception on failure
  3. During NamespaceManager.compute_qname(), where it will raise an instance of ValueError on failure

The wording of the warning/error has minor variations regarding serializations (sometimes specific serializations, sometimes serialization in general) and/or urlencoding (which is reasonable but occasionally misleading guidance, and depending on the user-supplied validation option, might not make any sense at all)

It would seem like a good idea to both consolidate the default wording and allow flexibility in that wording when changing the validation behavior.

The UX and performance of warnings, errors, and multiple checks

I personally strongly dislike the warning, in part because it's not clear to me how to do anything about it if my usage of the library is not done in a context where warnings are monitored (as libraries can get layered before being incorporated into an application, I may or may not have any control over what the actual application does). The use of a warning also requires that the check be done a second time during serialization, which seems to go against the performance concerns expressed by others.

I much prefer failures to happen as soon as possible, so I tried out a strict constructor option (which defaulted to off, to maintain compatibility) that turned that first warning into an error, and skipped the check in n3() if it had been strictly constructed to save time on serialization. To me, this results in more reliable code and more performant serialization, although I do not know how well it fits the design goals of this project.

Unexpected encapsulation violation

I tend to expect single-leading-underscore functions to not be imported into other modules, but _is_valid_uri() gets imported. This might just be my lack of familiarity with rdflib's conventions- importing such functions could be fine if that's the expected usage. But it makes fixing this more difficult, which is where I've paused working on it.

My inclination would be to have a class property that defaults to the current check, and can be changed. compute_qname() would access the validator through the class property. I have not investigated whether the performance is so sensitive that accessing the function through a property would cause problems.

If more flexibility is needed, the check could be overridden with a constructor parameter (the approach I implemented in the first place before realizing the encapsulation issue), but that would not help the compute_qname() case. A more thorough fix would probably be to offer some sort of context manager to temporarily modify the validation behavior.

If a context manager approach were to be implemented, it would be flexible enough that it could be the only mechanism involved. I'm not sure if/when I'll have time to rework what I did before, but I'll monitor this issue to see what the reaction and interest levels are.

handrews avatar Jun 20 '23 00:06 handrews

Only a partial response, will respond to the rest when I have more time:

I tend to expect single-leading-underscore functions to not be imported into other modules, but _is_valid_uri() gets imported. This might just be my lack of familiarity with rdflib's conventions- importing such functions could be fine if that's the expected usage. But it makes fixing this more difficult, which is where I've paused working on it.

We try to follow PEP-8, and the most critical reasoning for using leading underscores is to signal that a function is not part of our public API. This is in accordance with my understanding of PEP-8:

https://peps.python.org/pep-0008/#public-and-internal-interfaces

Even with all set appropriately, internal interfaces (packages, modules, classes, functions, attributes or other names) should still be prefixed with a single leading underscore.

https://peps.python.org/pep-0008/#descriptive-naming-styles

_single_leading_underscore: weak “internal use” indicator. E.g. from M import * does not import objects whose names start with an underscore.

Even with all that said, Python tooling does seem to assume single underscore identifiers are not used from outside modules. To us, the most critical thing is keeping our public API small, and even reducing its size, there is a lot of code that should not be used that is not clearly internal only.

aucampia avatar Jun 20 '23 09:06 aucampia

I was just reminded of this- I hope to get back to it, but in the meantime I should note that rfc3987 is licensed under GPL which makes it unsuitable for many uses. I'm working on an Apache2-licensed alternative (none of the many other libraries I've found handle IRIs correctly).

handrews avatar Apr 03 '24 19:04 handrews