Surveyadmin Normalization Notes

Open SachaG opened this issue 3 months ago • 0 comments

Concepts

answer: a single freeform answer from a respondent. Some questions accept multiple answers.
token: a canonical ID for a topic or concept. Depending on the question, one answer can have a single or multiple tokens.
- example: poor_developer_experience
entity: a special kind of token that also has additional metadata such as homepage URL, webFeatures ID, etc.
entities repo: the repo listing all tokens and entities.
- example: wide_gamut_colors
normalization: the process of finding matching tokens for freeform text answers. Can be re-run multiple times.
pattern matching: when a token is associated with an answer via regex pattern matching.
manual assignment: when a token is manually assigned to a specific answer.
responses: the database table that stores "raw" (un-normalized) responses.
normalized_responses: the database table that stores normalized responses.
tag: a special category ID that can be used to select many tokens at once.

Workflow

Some notes about my typical workflow when normalizing data.

1. Token Creation

Before anything else can happen, entities/tokens need to be added to the Entities repo. Thankfully there is already a large corpus of entities/tokens that should already cover a lot of ground.

Note that if a token is defined in a file called foo.yml, it will automatically belong to the foo tag, in addition to any tags defined it the token's own tags property.

Also, if foo.yml is contained in directory bar, which itself is contained in baz, that token will also carry the bar and baz tags.

🏷️ 2. Question MatchTags

Tags are assigned as part of the survey's question.yml outline. For example:

    - id: forms_pain_points
      disallowedTokenIds: [form_issues]
      matchTags: [common_pain_points, features_html]
      template: textList

This means that the system will look for matches using any token wit the common_pain_points and features_html tags, as well as the question's own id, in this case the forms_pain_points tag.

Note that the form_issues is disallowed, meaning it will be excluded from being matched (in this case because all answers here concern form issues, so having it as a match would only add noise).

Once tags are assigned, the system will look in the entities repo for any entity or token belonging to that tag.

To get a list of match tags, click Tokens -> About Question Match Tags.

3. Preliminary Normalization

Once a question has some tags assigned to it, matching tokens should be found for anywhere between 25% to 75% of answers.

4. 🔢 Word Frequencies

A good way to get ideas for additional tokens is to look at words that appear most frequently.

5. 🤖 AI Suggestions

All un-normalized answers can also be exported as a YAML file and fed to an LLM to get suggestions, which can then be re-imported into the system.

6. Adding Pattern-Matching Patterns

A good way to increase the number of matches is to add more patterns to existing entities/tokens in the [https://github.com/Devographics/entities/blob/62a7da1bb285906027788ee75b5230b169c4208c/tokens/pain_points/common_pain_points/common_pain_points.yml#L516](corresponding YAML file):

- id: bugs_and_stability_issues
  name: Bugs and stability issues
  patterns:
    - buggy
    - flakey
    - flakiness

7. Manual Token Assignment

If an answer cannot be matched using regex pattern matching, a token can be manually assigned to it. Manually assigned tokens will appear with a green border:

Manual assignments are stored in the database separately from a respondent's answer.

8. New Token Suggestions

When no applicable token exist, you can submit a new one. It will appear with a red border until it's approved:

Sep 02 '25 12:09 SachaG