Pogues
Pogues copied to clipboard
Suggester - DDI
We devised a first implementation of the suggester component in DDI.
Proposal
We slightly modify the d:QuestionItem/d:CodeDomain
for a single answer question with:
<r:GenericOutputFormat controlledVocabularyID="INSEE-GOF-CV">suggester</r:GenericOutputFormat>
<r:CodeListReference isExternal="true">
<r:URN>urn:ddi:fr.insee:communes-2023:1</r:URN>
<r:TypeOfObject>CodeList</r:TypeOfObject>
<r:UserAttributePair>
<r:AttributeKey>SuggesterConfiguration</r:AttributeKey>
<r:AttributeValue>{ "queryParser": { "type": "tokenized", "params": { "language": "French", "pattern": "[\\w]+", "min": "1" } }, "stopWords": ["de", "la", "les", "du", "et", "au", "aux", "en"], "max": 12 }
</r:AttributeValue>
</r:UserAttributePair>
</r:CodeListReference>
List of changes
- We add a value to the controlled vocabulary of the
r:GenericOutputFormat
element:suggester
. - We add the attribute
isExternal=true
tor:CodeListReference
- The
r:CodeListReference
use a singler:URN
which uniquely identify the code list - Finally, we make use of a
UserAttributePair
inside ther:CodeListReference
to pass the suggester configuration to Eno. The value here (insider:AttributeValue
) holds the JSON snippet used by the Suggester component in Lunatic.
Add for the "multiple" suggester : In Lunatic, it may fill several variables at a time.
The nomenclature has more than 2 columns. When getting the variable containing the code, we get variables from the other columns. In fact, these variables are calculated from the collected one, but Lunatic optimizes this calculation by "collecting" them.
They would be designed by a calculated variable which d:GenerationInstruction would be :
left_join(aaa, bbb using ccc, ddd)
where :
- aaa is the name of the variable collected with the simple suggester
- bbb is the name of the nomenclature used for the suggester
- ccc is the name of the column containing the id of the nomenclature
- ddd is the name of the column containing the value of the calculated variable
Example :
- initial collected variable : "birth-country"
- nomenclature : "Countries" with 3 columns : id, label, continent
Formula for the calculated variable "birth-continent" : left_join(birth-country, Countries using id, continent)
Alternative proposal : The only evolution in the d:QuestionItem/d:CodeDomain is the r:GenericOutputFormat "suggester". It refers to a codelist inside the questionnaire.
This codelist :
- contains no code
- refers to the external codelist with its URN
- contains the suggester parameters
If several responses use the same codelist, the suggester parameters are pooled.
<r:GenericOutputFormat controlledVocabularyID="INSEE-GOF-CV">suggester</r:GenericOutputFormat>
<r:CodeListReference>
<r:Agency>fr.insee</r:Agency>
<r:ID>j334iumu</r:ID>
<r:Version>1</r:Version>
<r:TypeOfObject>CodeList</r:TypeOfObject>
</r:CodeListReference>
and
<l:CodeList>
<r:Agency>fr.insee</r:Agency>
<r:ID>j334iumu</r:ID>
<r:Version>1</r:Version>
<r:Label>
<r:Content xml:lang="fr-FR">communes-2023</r:Content>
</r:Label>
<l:HierarchyType>Regular</l:HierarchyType>
<l:Level levelNumber="1">
<l:CategoryRelationship>Ordinal</l:CategoryRelationship>
</l:Level>
<r:CodeListReference isExternal="true">
<r:URN>urn:ddi:fr.insee:communes-2023:1</r:URN>
<r:TypeOfObject>CodeList</r:TypeOfObject>
</r:CodeListReference>
<r:UserAttributePair>
<r:AttributeKey>SuggesterConfiguration</r:AttributeKey>
<r:AttributeValue>{ "queryParser": { "type": "tokenized", "params": { "language": "French", "pattern": "[\\w]+", "min": "1" } }, "stopWords": ["de", "la", "les", "du", "et", "au", "aux", "en"], "max": 12 }
</r:AttributeValue>
</r:UserAttributePair>
</l:CodeList>
@BulotF add the true implementation of the <r:AttributeKey>SuggesterConfiguration</r:AttributeKey>
value (it is an XML payload instead of a JSON).
We'll make use of some r:UserID
for identification. Will be documented.
Implementation, before changing l:CodeListName with r:UserID :
<l:CodeList>
<r:Agency>fr.insee</r:Agency>
<r:ID>j334iumu</r:ID>
<r:Version>1</r:Version>
<r:UserAttributePair>
<r:AttributeKey>SuggesterConfiguration</r:AttributeKey>
<r:AttributeValue><![CDATA[<fields xmlns="http://xml.insee.fr/schema/applis/lunatic-h">
<name>id</name>
<rules>soft</rules>
</fields>
<queryParser xmlns="http://xml.insee.fr/schema/applis/lunatic-h">
<type>soft</type>
</queryParser>]]></r:AttributeValue>
</r:UserAttributePair>
<l:CodeListName>
<r:String xml:lang="fr-FR">in-error</r:String>
</l:CodeListName>
<r:Label>
<r:Content xml:lang="fr-FR">nomenclature in-error</r:Content>
</r:Label>
<r:CodeListReference isExternal="true">
<r:URN>urn:ddi:fr.insee:f7cbc001-29c7-482f-98ed-9121246db5a2:1</r:URN>
<r:TypeOfObject>CodeList</r:TypeOfObject>
</r:CodeListReference>
<l:HierarchyType>Regular</l:HierarchyType>
<l:Level levelNumber="1">
<l:CategoryRelationship>Ordinal</l:CategoryRelationship>
</l:Level>
</l:CodeList>
Below is another modeling proposal.
Changes are:
- suggester parameters are set in the CodeListReference in the CodeDomain and not in the CodeList
- two userIDs are added to the codeList (how to value them remains to be studied). These UserIDs are used to match the code lists for collection
- apart from the UserIDs, the codeList becomes a classic codeList:
- without suggester parameters
- without a CodeListReference inside
<?xml version="1.0" encoding="utf-8"?>
<ddi:FragmentInstance xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ddi="ddi:instance:3_3" xmlns:r="ddi:reusable:3_3" xmlns:d="ddi:datacollection:3_3"
xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:l="ddi:logicalproduct:3_3"
xsi:schemaLocation="ddi:instance:3_3 https://www.ddialliance.org/Specification/DDI-Lifecycle/3.3/XMLSchema/instance.xsd">
<ddi:TopLevelReference>
<r:Agency>fr.insee</r:Agency>
<r:ID>8af075bd-3a65-4ce1-80d8-18e20cca72cd</r:ID>
<r:Version>1</r:Version>
<r:TypeOfObject>QuestionItem</r:TypeOfObject>
</ddi:TopLevelReference>
<ddi:Fragment>
<d:QuestionItem>
<r:Agency>fr.insee</r:Agency>
<r:ID>8af075bd-3a65-4ce1-80d8-18e20cca72cd</r:ID>
<r:Version>1</r:Version>
<d:QuestionItemName>
<r:String xml:lang="fr-FR">CITY</r:String>
</d:QuestionItemName>
<d:QuestionText>
<d:LiteralText>
<d:Text xml:lang="fr-FR">In which city do the Simpsons reside?</d:Text>
</d:LiteralText>
</d:QuestionText>
<d:CodeDomain>
<r:GenericOutputFormat controlledVocabularyID="INSEE-GOF-CV">suggester</r:GenericOutputFormat>
<r:CodeListReference>
<r:URN>urn:ddi:fr.insee:8af075bd-3a65-4ce1-80d8-18e20cca72cc:1</r:URN>
<r:Agency>fr.insee</r:Agency>
<r:ID>8af075bd-3a65-4ce1-80d8-18e20cca72cc</r:ID>
<r:Version>1</r:Version>
<r:TypeOfObject>CodeList</r:TypeOfObject>
<r:UserAttributePair>
<r:AttributeKey>SuggesterConfiguration</r:AttributeKey>
<r:AttributeValue><![CDATA[<fields xmlns="http://xml.insee.fr/schema/applis/lunatic-h">
<name>id</name>
<rules>soft</rules>
</fields>
<queryParser xmlns="http://xml.insee.fr/schema/applis/lunatic-h">
<type>soft</type>
</queryParser>]]></r:AttributeValue>
</r:UserAttributePair>
</r:CodeListReference>
<r:ResponseCardinality maximumResponses="1"/>
</d:CodeDomain>
</d:QuestionItem>
</ddi:Fragment>
<ddi:Fragment>
<l:CodeList>
<r:URN>urn:ddi:fr.insee:8af075bd-3a65-4ce1-80d8-18e20cca72cc:1</r:URN>
<r:Agency>fr.insee</r:Agency>
<r:ID>8af075bd-3a65-4ce1-80d8-18e20cca72cc</r:ID>
<r:Version>1</r:Version>
<!-- Just an idea of value. To study what to put -->
<r:UserID typeOfUserID="url">https://collecte-api/web/classifications/geo/communes-2023-01-01</r:UserID>
<r:UserID typeOfUserID="url">https://collecte-api/offline/classifications/geo/communes-2023-01-01</r:UserID>
<l:CodeListName>
<r:String xml:lang="fr-FR">COMMUNES-2023-01-01</r:String>
</l:CodeListName>
<r:Label>
<r:Content xml:lang="fr-FR">Liste des communes au 1er janvier 2023</r:Content>
</r:Label>
<l:HierarchyType>Regular</l:HierarchyType>
<l:Level levelNumber="1">
<l:CategoryRelationship>Ordinal</l:CategoryRelationship>
</l:Level>
<l:Code>
<r:URN>urn:ddi:fr.insee:c6a0f7a1-c7dc-4a5e-a3df-da234057dd22:1</r:URN>
<r:Agency>fr.insee</r:Agency>
<r:ID>c6a0f7a1-c7dc-4a5e-a3df-da234057dd22</r:ID>
<r:Version>1</r:Version>
<r:CategoryReference>
<r:Agency>fr.insee</r:Agency>
<r:ID>916505d7-fe17-4e86-b32b-fb6a7783d7ef</r:ID>
<r:Version>1</r:Version>
<r:TypeOfObject>Category</r:TypeOfObject>
</r:CategoryReference>
<r:Value>75000</r:Value>
</l:Code>
<!-- etc. -->
</l:CodeList>
</ddi:Fragment>
</ddi:FragmentInstance>
@BulotF the current implementation is this one: https://github.com/InseeFr/Pogues/issues/682#issuecomment-1810438385 ?