integreat-cms
integreat-cms copied to clipboard
Giving feedback on the HIX value
Motivation
TextLab API returns some analysis details on a POST /benchmark/{id} request, that could potentially be delivered to users.
For example, there is a "dataTerms" attribute, which contains the following "categories":
-
Filler Words: Filler words fill up your sentences without adding any content to it.
In most cases you can simply drop them to improve your text. -
Finance and Insurance: Many financial or insurance terms are tough to understand for laypersons.
TextLab helps you to detect such terms.
Replace them with more common terms!
etc.
If some word falls into one of these categories, it will be listed in the "results" section of the category with (or without) the replacement suggestions. For example:
"result": [
{
"length": [
1
],
"position": [
0
],
"priority": 10,
"replacement": [],
"term": {
"check_words": 0,
"description": "Dieser Begriff kann negativ wirken.|Tipp: Versuchen Sie, auf den Begriff zu verzichten und neutral zu formulieren.",
"lemma": [
"unwiderruflich"
],
"settings": {},
"tag": [
"ADJD"
],
"term_id": 29531,
"wordcount": 1,
"words": [
"unwiderruflich"
]
}
}
]
Proposed Solution
I guess we could build some kind of report for users with this data?
Alternatives
TBD
User Story
TBD
Additional Context
Full response example: benchmark-response.txt The text being checked (just a set of "difficult" words): Unwiderrufliches Unterversicherung Unterjährigkeitszuschlag Rechtsschutzversicherungsgesellschaften
The response also includes a "languageTool" attribute with typos and grammatical errors, but I assume they don't affect the HIX value (do they?)
Design Requirements
TBD
Sometimes, however, I don't quite understand the HIX calculation logic. For example, if I analyze the sentence:
Willkommen in Augsburg
The HIX value is 11.72 (why?) but there is no "results" under "dataTerms".
There is also a "dataColorWords" attribute with the options: "colorBlue", "colorGreen", "colorRed", "colorYellow". Some words are distributed under these "colors" and it probably affects HIX value too, but the logic is not clear to me.
Sometimes, however, I don't quite understand the HIX calculation logic. For example, if I analyze the sentence:
Willkommen in Augsburg
The HIX value is 11.72 (why?) but there is no "results" under "dataTerms".
I think that text is too short to be properly analysed by HIX - like there are no mistakes, so nothing you could do better but percentagewise the words Willkommen and Augsburg are too long in relation to the length of the text, to allow for a higher HIX value. If that makes sense?
@osmers @ulliholtgrave not sure if this is the right issue here but i want to share the hix-parameters with you we might display. You can find them in a table here: https://nextcloud.tuerantuer.org/index.php/f/5287173
Nesting sets Long sentences Long words spelling, orthography Abbreviations Composites Legal German Filler words attribute Phrases and stuffy stuff Negative formulations negation Passive sentences Infinitive constructions Sentences in nominal style Sentences in the future tense
@dkehne we had a short look at this during the conference and basically said that the current design is too complex and to just add the information to the HIX-block, where we currently have the link to the wiki, right?
E.g. Your text contains 3 words, that are too long, 2 sentences that are too long and uses 7 filler words that are avoidable.
Is that still the plan? This could also be a design (just without the colors):
We might also add a hix details block below the editor. Might be leaner...
We might also add a hix details block below the editor. Might be leaner...
Yeah, that could also be an option - either way I think we need a new design, right?
What would the details block contain? Here is Toni's design again: https://www.figma.com/file/6U7R7Xj4wL7sbjxKRmOG9D/CMS-Project?type=design&node-id=1013-645&mode=design&t=UgltU5RpOK1UXSYa-0
I think we could simply work with the version that's in the bottom two screens (showing what's good and what needs to be checked) in the design but display them just below the current editor (not via a new page, as suggested in the design)
@dkehne and I talked about this issue today and agreed on the following list of parameters that we want to give feedback on:
Schachtelsätze Lange Sätze Lange Wörter Abkürzungen Sätze im Passiv Infinitiv-Konstruktionen Sätze im Nominalstil Sätze im Futur
@hauf-toni wir könnten uns nochmal treffen, damit wir das Design besprechen, ich fasse es aber auch gleich hier nochmal zusammen.
@MizukiTemma I put the prio on high because we want to communicate to the municipalities that this issue is our next development, since their HIX value dropped significantly due to the last release. I think the next steps are
- @hauf-toni will design the discussed changes
- we can meet and check if everything is clear
- this issue together with #2688 should be implemented
Any questions?
Updated Design can be found 📍 here @osmers @dkehne
I agree with the updated overall design. Great work. Just two comments from me.
@hauf-toni could you let us know once the comments/changes are done and the CMS team can start with the issue? :)
Currently we don’t request and don’t store the HIX score, when “Ignore HIX value” flag is set. Should this be changed?
Currently we don’t request and don’t store the HIX score, when “Ignore HIX value” flag is set. Should it be changed?
Hmm, good question - not sure if it is necessary, but if it is not too much work then we can just store the last value that was shown/calculated before the "Ignore HIX" field was checked, right?
@hauf-toni could you let us know once the comments/changes are done and the CMS team can start with the issue? :)
You can start the issue, the design is ready ⛳
Hmm, good question - not sure if it is necessary, but if it is not too much work then we can just store the last value that was shown/calculated before the "Ignore HIX" field was checked, right?
@osmers not too much, but it will make the HIX widget bigger and it will display the data that was actual at an unspecified point in time, when the content could be completely different. Can it be useful?
Hmm, good question - not sure if it is necessary, but if it is not too much work then we can just store the last value that was shown/calculated before the "Ignore HIX" field was checked, right?
@osmers not too much, but it will make the HIX widget bigger and it will display the data that was actual at an unspecified point in time, when the content could be completely different. Can it be useful?
Yeah, you are right - then you can implement it accordingly without showing the HIX value, I don't think a new design is needed :)
@osmers I think we might need more information on how to extract this data from the Textlab API response, because for the desired categories it is not explicitly provided:
Schachtelsätze Lange Sätze Lange Wörter Abkürzungen Sätze im Passiv Infinitiv-Konstruktionen Sätze im Nominalstil Sätze im Futur
For example, I'm checking this text:
Es war ein mal ein Mädchen mit einer roten Kappe, die es von seiner Großmutter, mit der zusammen es von einem Jäger, der gerade des Weges kam und den Wolf im Haus der Großmutter, die dem Wolf anschließend Wackersteine in den Bauch, den sie dann zugenäht hatte, gelegt hatte, damit der beim Saufen ins Wasser fällt.
This is the full json response
I can see in https://textlab.online/ that this text has 1 long sentence, 1 nested sentence, 1 sentence in Nominalstil.
But it doesn't seem to be quite obvious how to get it from json. There has to be some logic to how it's calculated.
UPD. The only clue I've found that looks relevant:
"countNominalStyle": [
[
34,
41,
0
]
],
"countNominalStyle": {
"clix": 1, // it's not the number of sentences
"paraPerc": 100,
"res": 11.11111111111111,
"scaleFrom": 20,
"scaleTo": 10,
"target": 12
},
But it's still raw data that needs to be interpreted.
@seluianova I sent your questions to our contact person at Textlab and I hope she will get back to us soon. I couldn't find anything either... I found the information on Nominal Style as well - I could imagine that the numbers represent the position of the words in the whole text but I'm not quite sure if that makes sense. Let's wait for her response.
Hi @seluianova I'll post the answer I received here and maybe you can provide me with the json response for a longer text with a few more of the parameters mentioned?
Thank you for your message. The values are included in the answer, but are not so easy to recognize as they have different names than on the surface. Here are the equivalents:
Schachtelsätze -> moreSentencesInClauses Lange Sätze -> moreSentencesInWords Lange Wörter -> moreWordsInLetters Abkürzungen -> Abkürzungen (Terminologielisten werden erst immer mit der ID genannt, weiter unten kommt dann „name“ mit dem deutschen und englischen Namen) Sätze im Passiv -> countPassiveVoiceInSentence Infinitiv-Konstruktionen -> countInfinitiveConstructions Sätze im Nominalstil -> countNominalStyle Sätze im Futur -> countFutureTenseInSentence
If there was a match for it in the text, then the results are in "result" for term lists. If the square bracket is empty, there was no match. For the other parameters, there is simply a colon directly after the name and then the square bracket. Here too: If the bracket is empty, there is no hit.
If there is a hit for terminology, then the length of the hit (i.e. how many words) is shown in "length" and the position in the text in "position". For "position", several positions can also be listed if the term occurs several times in the text.
For the remaining parameters at sentence level, the numbers are simply placed one below the other, i.e. start of the match, end of the match, length of the match. The length varies from the unit depending on the parameter. For nested sentences it indicates the number of sentence parts, for long sentences the number of words.
Long words only show the position of the hit and the length, in this case the number of letters.
@osmers to be honest, it didn't get much clearer 🙈
For the remaining parameters at sentence level, the numbers are simply placed one below the other, i.e. start of the match, end of the match, length of the match. The length varies from the unit depending on the parameter. For nested sentences it indicates the number of sentence parts, for long sentences the number of words
So,
"countNominalStyle": [
[
34,
41,
0
]
],
If 34 is "start of the match", what exactly is meant? Is it the number/index of the word with which the sentence in "nominal style" begins? Because it doesn't seem so. If 0 is "length of the match", what means 0 ?? 🤔
maybe you can provide me with the json response for a longer text with a few more of the parameters mentioned?
I can look something up, or if you give me the right text, I can attach a response for it.
Ah, wait. I have a guess. When we have one set of values, like here:
"countNominalStyle": [
[
34,
41,
0
]
],
It means, it's 1 sentence.
If we have 2 sets, like:
"countNominalStyle": [
[
2,
65,
56
],
[
70,
133,
56
]
],
It means - 2 sentences.
If so, we don't need to know what these numbers actually mean.
I will check the other params too.
@osmers I have localized all the data now, thanks for the input!
@seluianova perfect - then I will let her know that we have everything we need :) thanks for checking the data Since we have everything now - can you estimate when we will be able to test this in the test system maybe? :)
@osmers I think I could try to open a PR this week for all categories, except Abkürzungen (because it's slightly different). I can't say about Abkürzungen yet, I'll be working on entitlementcard next week, I may have to pass it on to someone else if it's just as high a priority (is it? or maybe we could deliver it separately?)
@seluianova sorry for the late reply, I was busy with networking events for Kommunen last week - I think it's ok if we do not deliver Abkürzungen right now since it is not something that influences the HIX directly I think but makes a text less readable. So if you can already merge and publish the linked pull request, then do so :) please
@osmers I have asked in the team. There are some refactoring left, but I can't resolve it right now coz I'm occupied with entitlementcard
@seluianova for how long will you be working on the entitlement card? Or who could take over this issue from you?
Abkürzungen were moved to the separate task: https://github.com/digitalfabrik/integreat-cms/issues/2778