python-mapswipe-workers
python-mapswipe-workers copied to clipboard
discrepancy between project progress
The project progress calculation reveals different results for different parts of the mapswipe code.
During transfer results we use this function: https://github.com/mapswipe/python-mapswipe-workers/blob/master/mapswipe_workers/mapswipe_workers/firebase_to_postgres/update_data.py#L570
In generate stats we use another function: https://github.com/mapswipe/python-mapswipe-workers/blob/master/mapswipe_workers/mapswipe_workers/generate_stats/project_stats_by_date.py#L8
For the project -NFNr55R_LYJvxP7wmte
the transfer results gives us 100% whereas we get 94.74% for the generate stats workflow.
This is the query in transfer results:
select
avg(group_progress)::integer as progress
from
(
-- Get all groups for this project and
-- add progress for groups that have been worked on already.
-- Set progress to 0 if no user has worked on this group.
-- For groups that no users worked on
-- there are no entries in the results table.
select
g.group_id
,g.project_id
,case
when group_progress is null then 0
else group_progress
end as group_progress
from groups g
left join
(
-- Here we get the progress for all groups
-- for which results have been submitted already.
-- Progress for a group can be max 100
-- even if more users than required submitted results.
-- The verification number of a project is used here.
select
ms.group_id
,ms.project_id
,case
when count(distinct user_id) >= p.verification_number then 100
else 100 * count(distinct user_id) / p.verification_number
end as group_progress
from mapping_sessions ms, projects p
where ms.project_id = p.project_id
group by group_id, ms.project_id, p.verification_number
) bar
on bar.group_id = g.group_id and bar.project_id = g.project_id
where g.project_id = '-NFNr55R_LYJvxP7wmte'
) foo
group by project_id
It seems that the code in generate stats is not giving the correct results.
query:
select
date_trunc('day', start_time) as day
,sum(items_count)
from mapping_sessions ms
where project_id = '-NFNr55R_LYJvxP7wmte'
group by day
order by day
result:
2022-11-18 00:00:00.000 13008
2022-11-19 00:00:00.000 65850
2022-11-20 00:00:00.000 45312
2022-11-21 00:00:00.000 47940
2022-11-22 00:00:00.000 35064
2022-11-23 00:00:00.000 53868
2022-11-24 00:00:00.000 46140
2022-11-25 00:00:00.000 42840
2022-11-26 00:00:00.000 35286
2022-11-27 00:00:00.000 33426
2022-11-28 00:00:00.000 6930
and here the results from the csv file:
day | number_of_results |
---|---|
2022-11-18 | 13008 |
2022-11-19 | 62844 |
2022-11-20 | 40878 |
2022-11-21 | 46362 |
2022-11-22 | 31872 |
2022-11-23 | 50214 |
2022-11-24 | 45612 |
2022-11-25 | 41664 |
2022-11-26 | 32508 |
2022-11-27 | 31686 |
2022-11-28 | 5298 |
It's not yet clear why there are results missing in the second table.
What is special about the first day? There is no discrepancy there.