CorporaCreator icon indicating copy to clipboard operation
CorporaCreator copied to clipboard

Demographic data is sometimes doubled per client ID

Open ftyers opened this issue 4 years ago • 6 comments

In some cases a given client_id might have more than one demographic datapoint (e.g. gender or age) linked to it. Often this is blank vs. male/female or blank vs. some age.

This is probably because people recorded some clips then made a profile, or because they became logged out.

In any case it would be good (and probably safe) to replace blank in the field with the more specific datapoint if and only if there are no other datapoints associated with the client_id.

Some examples from Turkish, with thanks to @harikalarkutusu!

image

ftyers avatar Oct 14 '21 14:10 ftyers

I'm not sure of the reason thou... To my experience, you can do 100 recordings per hour, if done right (read silent / record / listen / re-record if necessary). If not done right, it may increase to 150-200 recs/hour...

As the id is calculated from session-id, that would mean (ex: line 4) someone made 374 recordings (2-4 hours) then decided to register. This seems a bit odd. There are 26 such anomalies in the Turkish dataset.

HarikalarKutusu avatar Oct 14 '21 14:10 HarikalarKutusu

OK, I can see how this is possible. During/after the server upgrades many of us got kicked out of the system while we had to re-login multiple times a day. I saw some people in our community complain about validating their own sentences which made me aware of this issue.

If a user starts by registering & logging in with demographic info filled and later kicked out but continues without logging in this might happen.

In any case it would be good (and probably safe) to replace blank in the field with the more specific datapoint if and only if there are no other datapoints associated with the client_id.

I think this will be a very logical solution.

HarikalarKutusu avatar Jan 25 '22 06:01 HarikalarKutusu

Please see: https://discourse.mozilla.org/t/major-loss-in-demographic-data/92123

HarikalarKutusu avatar Jan 29 '22 03:01 HarikalarKutusu

In some cases a given client_id might have more than one demographic datapoint

Often, I create accounts on my phone/notebook to allow people to record and validate, the reasons are either because their phone is not supported, or they can't do it themselves, elderly need bigger screen to read so I use a notebook, if the client_id is associated with the device, then you will find one client_id with many demographic data points.

danielinux7 avatar Jan 31 '22 00:01 danielinux7

The client_id should be associated with the browser session. Using a single session or single account with multiple people is advised against iirc.

ftyers avatar Jan 31 '22 10:01 ftyers

It seems the session is not terminated when the tab or the Chrome browser is closed on Android. It's possible that when I create multiple accounts on the same device, might have the same client_ID.

> is advised against iirc.

I'm not sure what iirc is?

danielinux7 avatar Feb 01 '22 17:02 danielinux7