SolRDF copied to clipboard
Dynamically Bootstrap Named Analysed Fields for Searching and Boosting
Hi @agazzarini, the current schema in SolRDF is mostly focused on the use case as a SPARQL endpoint, i.e. its object literals are being indexed into unanalysed string fields. To accomodate a more common use case where we also want to be able to do analysed field searching and per field boosting we could write object literals into named fields derived from the QNames. As Solr provides the mechanism of dynamic fields we propose the following enhancement:
Transform the QName and optional datatype and language information into a field name of the following structure:
Use abstract heuristics to provide a basic search schema. This can be adapted to the actual requirements of the dataset. We make the genral assumption that all fields can have multiple values:
Map untyped and language less literals to text_general:
<dynamicField name="*_xsd_string" type="text_general" indexed="true" stored="true" multiValued="true"/>
Map literals with language information to corresponding language text fields:
<dynamicField name="*_xsd_string_de" type="text_de" indexed="true" stored="true" multiValued="true"/>
Map typed literals with datatypes to corresponding fields:
xsd:integer => <dynamicField name="*_xsd_integer" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:nonPositiveInteger => <dynamicField name="*_xsd_nonPositiveInteger" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:NegativeInteger => <dynamicField name="*_xsd_negativeInteger" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:long => <dynamicField name="*_xsd_long" type="tlong" indexed="true" stored="true" multiValued="true"/>
xsd:unsignedLong => <dynamicField name="*_xsd_unsignedLong" type="tlong" indexed="true" stored="true" multiValued="true"/>
xsd:int => <dynamicField name="*_xsd_int" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:unsignedInt => <dynamicField name="*_xsd_unsignedInt" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:short => <dynamicField name="*_xsd_short" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:unsignedShort => <dynamicField name="*_xsd_unsignedShort" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:byte => <dynamicField name="*_xsd_byte" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:unsignedByte => <dynamicField name="*_xsd_unsignedByte" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:nonNegativeInteger => <dynamicField name="*_xsd_nonNegativeInteger" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:positiveInteger => <dynamicField name="*_xsd_positiveInteger" type="tint" indexed="true" stored="true" multiValued="true"/>
xsd:float => <dynamicField name="*_xsd_float" type="tfloat" indexed="true" stored="true" multiValued="true"/>
xsd:decimal => <dynamicField name="*_xsd_decimal" type="tfloat" indexed="true" stored="true" multiValued="true"/>
xsd:double => <dynamicField name="*_xsd_double" type="tdouble" indexed="true" stored="true" multiValued="true"/>
xsd:boolean => <dynamicField name="*_xsd_boolean" type="boolean" indexed="true" stored="true" multiValued="true"/>
xsd:string => <dynamicField name="*_xsd_string" type="text_general" indexed="true" stored="true" multiValued="true"/>
xsd:hexBinary => <dynamicField name="*_xsd_hexBinary" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:base64Binary => <dynamicField name="*_xsd_base64Binary" type="binary" indexed="true" stored="true" multiValued="true"/>
xsd:anyURI => <dynamicField name="*_xsd_anyURI" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:QName => <dynamicField name="*_xsd_QName" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:NOTATION => <dynamicField name="*_xsd_NOTATION" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:normalizedString => <dynamicField name="*_xsd_normalizedString" type="text_general" indexed="true" stored="true" multiValued="true"/>
xsd:token => <dynamicField name="*_xsd_token" type="text_general" indexed="true" stored="true" multiValued="true"/>
xsd:language => <dynamicField name="*_xsd_language" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:IDREFS => <dynamicField name="*_xsd_IDREFS" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:IDREF => <dynamicField name="*_xsd_IDREF" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:ENTITIES => <dynamicField name="*_xsd_ENTITIES" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:ENTITY => <dynamicField name="*_xsd_ENTITY" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:NMTOKENS => <dynamicField name="*_xsd_NMTOKENS" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:Name => <dynamicField name="*_xsd_Name" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:NCName => <dynamicField name="*_xsd_NCName" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:ID => <dynamicField name="*_xsd_ID" type="string" indexed="true" stored="true" multiValued="true"/>
Map date and dateTime types to a date field and supplement the missing values (e.g. "2015" => "2015-01-01T00:00:00Z"):
xsd:date => <dynamicField name="*_xsd_date" type="tdate" indexed="true" stored="true" multiValued="true"/>
Map duration to a string field:
xsd:duration => <dynamicField name="*_xsd_duration" type="string" indexed="true" stored="true" multiValued="true"/>
Map Gregorian date fields to a string field:
xsd:gYearMonth => <dynamicField name="*_xsd_gYearMonth" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:gYear => <dynamicField name="*_xsd_gYear" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:gMonthDay => <dynamicField name="*_xsd_gMonthDay" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:gDay => <dynamicField name="*_xsd_gDay" type="string" indexed="true" stored="true" multiValued="true"/>
xsd:gMonth => <dynamicField name="*_xsd_gMonth" type="string" indexed="true" stored="true" multiValued="true"/>
Hi @ahagenbruch sounds really interesting. Many thanks for such detailed proposal. I introduced the "Hybrid" mode for mixing Solr and plain RDF features so that could be something that goes under that direction. I strongly agree with you that StrFields have a limited power in terms of querying capabilities.
I have to read again your proposal and then investigate what kind of impacts it should have on the existing code. In the meantime a question: let's suppose we changed the schema in such way. What kind of queries are you issuing to SolRDF? I think, using plain SPARQL, you won't get any benefit from such schema. Do you want to use Solr built-in parsers and get results in SPARQL-results?
Thanks again
BTW: I created a user list on google. If you want feel free to join us. We could discuss about this thing also with other (few at the moment) users.
@ahagenbruch I'm moving the discussion back here as these are concrete implementation details. Two doubts:
Field name
You said, in your proposal:
What about the prefix? In your schema example we have a skos:notation and ok, skos is a widely used / standard namespace. But what about custom namespaces? It doesn't sound good to index something like:
because "pippo" could be known only at index time; at query time you couldn't be aware about prefixes I previously used in indexing or, you could use the same namespace mapped with a different prefix (e.g. pluto:mynote at query time and pippo:mynote at index time, where pippo and pluto points to the same namespace URI)
Multivalued fields
You said
We make the general assumption that all fields can have multiple values
Why? Each triple (i.e. each document) will have exactly one value for the object field, regardless the schema we will use. Am I missing something about your proposal?
Am 18.05.15 um 15:11 schrieb Andrea Gazzarini:
Hi Andrea,
You said, in your proposal:
|prefix_predicateName[_datatype][_lang] |
What about the prefix? In your schema example we have a skos:notation and ok, skos is a widely used / standard namespace. But what about custom namespaces? It doesn't sound good to index something like:
|pippo_mynote_xsd_string |
because "pippo" could be known only at index time; at query time you couldn't be aware about prefixes I previously used in indexing or, you could use the same namespace mapped with a different prefix (e.g. pluto:mynote at query time and pippo:mynote at index time, where pippo and pluto points to the same namespace URI)
I see your point, but I had these two use cases in mind when I wrote the proposal:
- Fielded search: The user wants to search on a specific field instead of on an aggregated field for 'simple search'. In most cases this would be done in an advanced search form in the front end where the user doesn't have to know about the actual field name in the index but sees a field name for general consumption (e.g. 'dcterms_title_xsd_string' vs. 'Title'). The same would hold true if you exposed the document search via an API. It would be your responsibility to document the field names (possibly having a mapping in your API to more readable names to make them more developer friendly).
- Weigthed fields in a request handler: If you expose your 'simple search' via a request handler that has for instance an eDismax query parser you put your field names and boost values into the qf parameter and are thus in control of what fields will be searched and how they contribute to the overall score. If you don't do this you will probably feed your fields to an overall search field via copy fields in your schema. In either case you know the field names...
Multivalued fields
You said
We make the general assumption that all fields can have multiple values
Why? Each triple (i.e. each document) will have exactly one value for the object field, regardless the schema we will use. Am I missing something about your proposal?
By document I mean the subject URI as the document ID, the predicates as field names and the object literals as their values. As we can't know in advance which of our predicates might hold a list of objects* the safe way seems to make all fields multi valued in the most general schema I proposed. If (as in my other two example schemas) you tailor the fields more to your dataset's needs, you probably don't want to make fields for which you know that they are single valued multi valued...
- e.g.
<thsys/72180> a skos:Concept, zbwext:Thsys ; rdfs:label "Statistics"@en, "Statistik"@de ; ...