virtuoso-opensource icon indicating copy to clipboard operation
virtuoso-opensource copied to clipboard

Inconsistent string typing in SPARQL JSON results

Open cecemel opened this issue 5 months ago • 6 comments

Hello,

When results are returned as application/sparql-results+json, the datatype of strings is inconsistent depending on character content. If the string contains only ASCII characters, no datatype is added. If it includes non-ASCII characters, the datatype http://www.w3.org/2001/XMLSchema#string is attached.

This inconsistency causes issues further down the line (e.g DELETE) as Virtuoso does not treat both forms as equivalent.

Virtuoso version

Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.10.3237-pthreads as of Jul 12 2023 (000000)
Compiled for Linux (x86_64-pc-linux-gnu)
Copyright (C) 1998-2023 OpenLink Software

Below are steps to reproduce the issue.

Let us know if you're able to reproduce it on your side.

Thanks!

Step: insert the data

INSERT DATA {
  GRAPH <http://debug> {
    <http://test/1> <http://bar> "België spelled correctly".
    <http://test/2> <http://bar> "Belgie spelled incorrectly".
  }
}

Step: check the data with ASCII-only

SELECT DISTINCT ?s ?p ?o WHERE {
  GRAPH <http://debug> {
    ?s ?p ?o
  }
  VALUES ?s {
    <http://test/2>
  }
  VALUES ?p {
   <http://bar>
  }
  VALUES ?o {
   "Belgie spelled incorrectly"
  }
}

Return as XML 🆗

<sparql
    xmlns="http://www.w3.org/2005/sparql-results#"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/sw/DataAccess/rf1/result2.xsd">
    <head>
        <variable name="s"/>
        <variable name="p"/>
        <variable name="o"/>
    </head>
    <results distinct="false" ordered="true">
        <result>
            <binding name="s">
                <uri>http://test/2</uri>
            </binding>
            <binding name="p">
                <uri>http://bar</uri>
            </binding>
            <binding name="o">
                <literal>Belgie spelled incorrectly</literal>
            </binding>
        </result>
    </results>
</sparql>

Return as application/sparql-results+json 🆗

{
  "head": {
    "link": [],
    "vars": [
      "s",
      "p",
      "o"
    ]
  },
  "results": {
    "distinct": false,
    "ordered": true,
    "bindings": [
      {
        "s": {
          "type": "uri",
          "value": "http://test/2"
        },
        "p": {
          "type": "uri",
          "value": "http://bar"
        },
        "o": {
          "type": "literal",
          "value": "Belgie spelled incorrectly"
        }
      }
    ]
  }
}

Step: check the data with non-ASCII

SELECT DISTINCT ?s ?p ?o WHERE {
  GRAPH <http://debug> {
    ?s ?p ?o
  }
  VALUES ?s {
    <http://test/1>
  }
  VALUES ?p {
   <http://bar>
  }
  VALUES ?o {
   "België spelled correctly"
  }
}

Return as XML 🆗

<sparql
    xmlns="http://www.w3.org/2005/sparql-results#"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/sw/DataAccess/rf1/result2.xsd">
    <head>
        <variable name="s"/>
        <variable name="p"/>
        <variable name="o"/>
    </head>
    <results distinct="false" ordered="true">
        <result>
            <binding name="s">
                <uri>http://test/1</uri>
            </binding>
            <binding name="p">
                <uri>http://bar</uri>
            </binding>
            <binding name="o">
                <literal>België spelled correctly</literal>
            </binding>
        </result>
    </results>
</sparql>

Return as application/sparql-results+json ❌

{
  "head": {
    "link": [],
    "vars": [
      "s",
      "p",
      "o"
    ]
  },
  "results": {
    "distinct": false,
    "ordered": true,
    "bindings": [
      {
        "s": {
          "type": "uri",
          "value": "http://test/1"
        },
        "p": {
          "type": "uri",
          "value": "http://bar"
        },
        "o": {
          "type": "typed-literal",
          "datatype": "http://www.w3.org/2001/XMLSchema#string",
          "value": "Belgi\\u00EB spelled correctly"
        }
      }
    ]
  }
}

cecemel avatar Jul 22 '25 10:07 cecemel

We are looking into this ...

HughWilliams avatar Jul 22 '25 11:07 HughWilliams

The value of "value": "Belgi\u00EB spelled correctly" is a valid JSON escaped unicode string and will be parsed properly by any JSON aware parser.

Which can be seen if the Virtuoso JSON output:

{ "head": { "link": [], "vars": ["s", "p", "o"] },
  "results": { "distinct": false, "ordered": true, "bindings": [
    { "s": { "type": "uri", "value": "http://test/1" }	, "p": { "type": "uri", "value": "http://bar/" }	, "o": { "type": "typed-literal", "datatype": "http://www.w3.org/2001/XMLSchema#string", "value": "Belgi\u00EB spelled correctly" }} ] } }

is copied into a JSON aware parser like https://jsonformatter.org/json-parser which returns:

{
  "head": {
    "link": [],
    "vars": [
      "s",
      "p",
      "o"
    ]
  },
  "results": {
    "distinct": false,
    "ordered": true,
    "bindings": [
      {
        "s": {
          "type": "uri",
          "value": "http://test/1"
        },
        "p": {
          "type": "uri",
          "value": "http://bar/"
        },
        "o": {
          "type": "typed-literal",
          "datatype": "http://www.w3.org/2001/XMLSchema#string",
          "value": "België spelled correctly"
        }
      }
    ]
  }
}

HughWilliams avatar Jul 23 '25 11:07 HughWilliams

@HughWilliams — If I understand correctly, the concern raised by @cecemel is that while the input datatypes are the same (or at least, are both unset) —

INSERT DATA {
  GRAPH <http://debug> {
    <http://test/1> <http://bar> "België spelled correctly".
    <http://test/2> <http://bar> "Belgie spelled incorrectly".
  }
}

— the output datatypes aren't the same, i.e. —

        "o": {
          "type": "literal",
          "value": "Belgie spelled incorrectly"
        }

vs

        "o": {
          "type": "typed-literal",
          "datatype": "http://www.w3.org/2001/XMLSchema#string",
          "value": "België spelled correctly"
        }

On that basis, I'm reopening the issue, pending indication by @cecemel that their concern is resolved.

TallTed avatar Jul 23 '25 17:07 TallTed

@HughWilliams — If I understand correctly, the concern raised by @cecemel is that while the input datatypes are the same (or at least, are both unset) —

INSERT DATA {
  GRAPH <http://debug> {
    <http://test/1> <http://bar> "België spelled correctly".
    <http://test/2> <http://bar> "Belgie spelled incorrectly".
  }
}

— the output datatypes aren't the same, i.e. —

        "o": {
          "type": "literal",
          "value": "Belgie spelled incorrectly"
        }

vs

        "o": {
          "type": "typed-literal",
          "datatype": "http://www.w3.org/2001/XMLSchema#string",
          "value": "België spelled correctly"
        }

On that basis, I'm reopening the issue, pending indication by @cecemel that their concern is resolved.

That's indeed the issue, thanks for clarifying!

cecemel avatar Jul 25 '25 10:07 cecemel

Yes, we are preparing a fix for that issue ...

HughWilliams avatar Jul 25 '25 10:07 HughWilliams

Sorry to ask, any updates on this?

cecemel avatar Nov 24 '25 14:11 cecemel