[Task]: Store full add-on manifest in the database
Description
We're frequently asked to find add-ons that use a specific manifest property while filtering the results by some other criteria, like only listed versions from add-ons with more than X users.
We currently can only do that for permissions (through WebextPermission model/webext_permissions table), but it would be useful for all manifest properties, so let's store the full JSON manifest in a JSONField on another model/table for each Version.
A one-off task & process_addons command to trigger it should be included to backfill data (which will take a while...)
Acceptance Criteria
### Milestones/checkpoints
- [ ] Manifest is available for all versions in a json field in a new model/table
- [ ] Backfill task is working
- [ ] Data is visible through the admin
- [ ] Ticket is filed for SRE to expose the table in redash
Checks
- [X] If the issue is ready to work on, I have removed the "needs:info" label.
- [X] If I have identified that the work is specific to a repository, I have removed "repository:addons-server" or "repository:addons-frontend"
┆Issue is synchronized with this Jira Task
I wonder if that would be performant enough, crawling through millions of json blobs. Could we denormalize the overall blob into specific table columns for each top level json property, perhaps?
The performance is obviously going to be worse than actual (possibly indexed) columns but I think denormalizing here would be premature. Especially as this is meant for adhoc searches for which taking a bit of extra time is acceptable. Also, as mentioned in the description we already have the webext_permissions available for more specific searches as well.
Couple recent examples of where it could have been useful:
- Figuring out how many add-ons have an update url
- Figuring out how many themes have images set
SRE ticket is https://mozilla-hub.atlassian.net/browse/SVCSE-2158
I'll trigger the task on dev & stage tomorrow.
Example using stage data:
>>> FileManifest.objects.count()
809622
>>> FileManifest.objects.filter(
Q(manifest_data__browser_specific_settings__gecko__update_url__isnull=False) |
Q(manifest_data__apps__gecko__update_url__isnull=False)).count()
30
>>> str(FileManifest.objects.only('file_id').filter(
Q(manifest_data__browser_specific_settings__gecko__update_url__isnull=False) |
Q(manifest_data__apps__gecko__update_url__isnull=False)).query)
'SELECT `files_filemanifest`.`file_id` FROM `files_filemanifest`
WHERE (
JSON_EXTRACT(`files_filemanifest`.`manifest_data`, $."browser_specific_settings"."gecko"."update_url") IS NOT NULL
OR JSON_EXTRACT(`files_filemanifest`.`manifest_data`, $."apps"."gecko"."update_url") IS NOT NULL
)'
JSON_CONTAINS_PATH variant (either are super fast on stage):
>>> FileManifest.objects.filter(
Q(manifest_data__browser_specific_settings__gecko__has_key='update_url') |
Q(manifest_data__apps__gecko__has_key='update_url')).count()
30
>>> str(FileManifest.objects.only('file_id').filter(
Q(manifest_data__browser_specific_settings__gecko__has_key='update_url') |
Q(manifest_data__apps__gecko__has_key='update_url')).query)
'SELECT `files_filemanifest`.`file_id` FROM `files_filemanifest`
WHERE (
JSON_CONTAINS_PATH(`files_filemanifest`.`manifest_data`, \'one\', $."browser_specific_settings"."gecko"."update_url")
OR JSON_CONTAINS_PATH(`files_filemanifest`.`manifest_data`, \'one\', $."apps"."gecko"."update_url")
)'