CorporaCreator
CorporaCreator copied to clipboard
Demographic data is sometimes doubled per client ID
In some cases a given client_id might have more than one demographic datapoint (e.g. gender or age) linked to it. Often this is blank vs. male/female or blank vs. some age.
This is probably because people recorded some clips then made a profile, or because they became logged out.
In any case it would be good (and probably safe) to replace blank in the field with the more specific datapoint if and only if there are no other datapoints associated with the client_id.
Some examples from Turkish, with thanks to @harikalarkutusu!

I'm not sure of the reason thou... To my experience, you can do 100 recordings per hour, if done right (read silent / record / listen / re-record if necessary). If not done right, it may increase to 150-200 recs/hour...
As the id is calculated from session-id, that would mean (ex: line 4) someone made 374 recordings (2-4 hours) then decided to register. This seems a bit odd. There are 26 such anomalies in the Turkish dataset.
OK, I can see how this is possible. During/after the server upgrades many of us got kicked out of the system while we had to re-login multiple times a day. I saw some people in our community complain about validating their own sentences which made me aware of this issue.
If a user starts by registering & logging in with demographic info filled and later kicked out but continues without logging in this might happen.
In any case it would be good (and probably safe) to replace blank in the field with the more specific datapoint if and only if there are no other datapoints associated with the client_id.
I think this will be a very logical solution.
Please see: https://discourse.mozilla.org/t/major-loss-in-demographic-data/92123
In some cases a given client_id might have more than one demographic datapoint
Often, I create accounts on my phone/notebook to allow people to record and validate, the reasons are either because their phone is not supported, or they can't do it themselves, elderly need bigger screen to read so I use a notebook, if the client_id is associated with the device, then you will find one client_id with many demographic data points.
The client_id should be associated with the browser session. Using a single session or single account with multiple people is advised against iirc.
It seems the session is not terminated when the tab or the Chrome browser is closed on Android. It's possible that when I create multiple accounts on the same device, might have the same client_ID.
> is advised against iirc.
I'm not sure what iirc is?