openverse
openverse copied to clipboard
Data normalization
| Start Date | Project Lead | Actual Ship Date |
|---|---|---|
| 2023-09-01 | @krysal | TBD |
Description
This project aims to save the cleaned data of the Data Refresh process and remove those steps from said process to save time.
Documents
- #3848
Milestone / Issues
- Data normalization
- #3415
Prior Art
- https://github.com/WordPress/openverse/pull/904
- https://github.com/WordPress/openverse/pull/2557
- https://github.com/cc-archive/cccatalog/pull/517
Future work - Phase Two
- #244
Prerequisites
- #416
- #412
@obulat when you are able could you add some of the context / additional info you've mentioned previously for this project?
The implementation plan is up for discussion at #3848. Writing it helped me ensure where we were starting from and define a scope for the project while indicating what could be done in a second phase, as suggested in the initial post. I hope others find it helpful too.
After its approval, the milestone should be complemented with a some issues:
- [x] Modify Ingestion Server to upload TSV files to AWS S3 and save fixed tags
- [ ] Check cleanup steps times of the Ingestion Server after running the batched update from files DAG.
Since the last update, the IP has been approved, and work has started on fixing duplicated tags. This has been a bit delayed, given solution proposal differences, but once the modification to the catalog is solved (#3926), we can delete current duplicates in upstream DB (#1566) and continue with the rest of the milestone (#23).
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Done
- https://github.com/WordPress/openverse/issues/1566
- https://github.com/WordPress/openverse/issues/3885 – The code part is done, the DAG was triggered and is currently running.
In progress
- https://github.com/WordPress/openverse/pull/4163
Added
- https://github.com/WordPress/openverse/issues/4199 – @AetherUnbound recently told me about this. The tags will need extra processing before a definitive win over duplicates is called.
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Done
- https://github.com/WordPress/openverse/pull/4163
- https://github.com/WordPress/openverse-infrastructure/pull/883
In progress
Previous merged PRs should solve https://github.com/WordPress/openverse/issues/3912. I'm waiting for a run of the image data refresh to confirm we save and have the files, which is currently stopped/blocked on https://github.com/WordPress/openverse/pull/4315 but that should be resolved between today and tomorrow. So I'm hoping the process is resumed soon and we can have the files this week.
To do
In the meantime, I can work on the next step:
- https://github.com/WordPress/openverse/issues/3415
An image data refresh in production couldn't finish with the changes from #4163, so we added more logging #4358, rolled back prod ingestion server, and decided to perform the cleanup process on the dev environment. An attempt with a data refresh limit resurrected an old problem (#736, #4381), which we already have a fix for, #4382. After merging #4382 on Monday, we must deploy the dev ingestion server and trigger the image data refresh to continue debugging.
The add_license_url DAG presented some issues of time outs as well and was refactored. The PR is pending revision:
- #4198
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Done
- The project was restructured in the way the updates are expected to be executed (using the
batched_updateDAG or a similar process mainly) and reducing the scope since now we don't need to remove tags (originated by discussion in #4417) - The
add_license_urlDAG ran successfully two times with the latest update (#4370), although strangely, a group of rows is receding and losing the value. #4318 was created to track this problem.
In progress
- We'll still use the TSV files with fixed URLs from the ingestion Server, so the PR to upload them to S3 is up and ready for revision: #4471
- @sarayourfriend created the configuration necessary to fix #4199 in a one-off DAG that, with the last touch of #4473, should be ready to run
- #4452 is being discussed and also worked on
To do
- I'm working on #3415. I did some manual tests and managed to load a table from the S3 files directly in staging, finding the extension for importing data into an RDS for PostgreSQL extremely useful. It can save us many headaches with networking, local file managing, and potential disk space issues. The second part of the task is actually performing the updates. I'm looking into whether batch updates can be parallelized (as done on the ingestion server).
I did some manual tests and managed to load a table from the S3 files directly in staging, finding the extension for importing data into an RDS for PostgreSQL extremely useful.
For what it's worth @krysal, you can definitely test that locally, rather than needing to use a live environment. We use the extension already for iNaturalist, so there are examples in the codebase of how to do it (including with support for local files for testing and development). Check this one out, for example: https://github.com/WordPress/openverse/blob/697f62f01a32cb7fcf2f4a7627650a113cba40da/catalog/dags/providers/provider_csv_load_scripts/inaturalist/observations.sql#L28
@sarayourfriend I did not think of iNaturalist as a reference here, and the relationship had not been mentioned until now. That's good to know! I thought of testing in the staging DB first because, from the documentation, I understood the extension is specifically for an Amazon RDS Postgres instance, so it's excellent information to know it works for local Postgres instance. Thank you!
Done
- #4495. Unanticipated. Required to simulate working with S3 locally in the catalog.
- #3912
- Required additionally: #4471
- The changes were deployed live today, so next week, we should start getting freshly cleaned values directly into S3.
- #1566
In progress
- #4452
- Partially solved by #4475
- #3415. I couldn't work much on this while resolving other issues, but now (hopefully) I'll be able to focus on it.
- https://github.com/WordPress/openverse/pull/4475
- It wasn't possible to run the DAG, so the approach for #4452 must be changed.
- https://github.com/WordPress/openverse/pull/4610
- Created an awaiting revision.
This week maintainers were off from Openverse work so the tasks will be resumed next week.
The catalog_cleaner DAG ran for the programmed fields successfully, so it habilitates #1411 and #700 for next week after the data refresh if the process doesn't produce more files with changes :)
Besides that, what remains to do is #4452.
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
Done
- #700 The Ingestion Server was deployed on Wednesday so next week we we will check the time gained with other data refresh run.
- #1663
To Do
- [ ] #4452 This issue has become more complex than initially planned. It was blocked by #4732, so given this is was resolved it can be resumed and @sarayourfriend expressed interest in continue to solve it.
- [x] Verify the next image data refresh runs successfully.
Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.
@WordPress/openverse-maintainers last week @krysal and I discussed the idea of sunsetting this project, with #4452 extracted out as a standalone issue to be worked on later this year.
In hindsight, this project was defined with two goals that were a bit less clear than we initially thought:
- The catalog database (upstream) contains the cleaned data outputs of the current Ingestion Server's cleaning steps
- The image Data Refresh process is simplified by reducing significantly cleaning times.
The first goal, in particular, is very open to interpretation and changes over time. Our data will never be perfect; does that mean we need to incorporate every new cleanup action into the scope of this work? That seems untenable.
The goal to remove the cleanup step from the data refresh has been met; I propose we close this project and move on.
If anyone objects: please share. Otherwise, I'll ask @krysal to move the project to success and close this issue next week.
I agree. The first goal actually is clear (in my reading), in that it specifies the "outputs of the current Ingestion Server's cleaning steps". I think, rather, we've let the scope get away from that boundary of the ingestion server cleaning steps, into a total "data cleaning" project.
Definitely okay closing this out based on that - the rest of the data cleaning issues that come up we can prioritize alongside other work!
This project has been closed and moved to success.