openverse Data normalization

Start Date Project Lead Actual Ship Date

2023-09-01 @krysal TBD

Start Date	Project Lead	Actual Ship Date
2023-09-01	@krysal	TBD

Description

This project aims to save the cleaned data of the Data Refresh process and remove those steps from said process to save time.

Documents

#3848

Milestone / Issues

Data normalization
#3415

Prior Art

https://github.com/WordPress/openverse/pull/904
https://github.com/WordPress/openverse/pull/2557
https://github.com/cc-archive/cccatalog/pull/517

Future work - Phase Two

#244

Prerequisites

#416
#412

Feb 18 '23 05:02 obulat

@obulat when you are able could you add some of the context / additional info you've mentioned previously for this project?

May 17 '23 17:05 zackkrida

The implementation plan is up for discussion at #3848. Writing it helped me ensure where we were starting from and define a scope for the project while indicating what could be done in a second phase, as suggested in the initial post. I hope others find it helpful too.

After its approval, the milestone should be complemented with a some issues:

[x] Modify Ingestion Server to upload TSV files to AWS S3 and save fixed tags
[ ] Check cleanup steps times of the Ingestion Server after running the batched update from files DAG.

Mar 06 '24 16:03 krysal

Since the last update, the IP has been approved, and work has started on fixing duplicated tags. This has been a bit delayed, given solution proposal differences, but once the modification to the catalog is solved (#3926), we can delete current duplicates in upstream DB (#1566) and continue with the rest of the milestone (#23).

Apr 03 '24 15:04 krysal

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

Apr 18 '24 00:04 openverse-bot

Done

https://github.com/WordPress/openverse/issues/1566
https://github.com/WordPress/openverse/issues/3885 – The code part is done, the DAG was triggered and is currently running.

In progress

https://github.com/WordPress/openverse/pull/4163

Added

https://github.com/WordPress/openverse/issues/4199 – @AetherUnbound recently told me about this. The tags will need extra processing before a definitive win over duplicates is called.

Apr 24 '24 23:04 krysal

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

May 09 '24 00:05 openverse-bot

Done

https://github.com/WordPress/openverse/pull/4163
https://github.com/WordPress/openverse-infrastructure/pull/883

In progress

Previous merged PRs should solve https://github.com/WordPress/openverse/issues/3912. I'm waiting for a run of the image data refresh to confirm we save and have the files, which is currently stopped/blocked on https://github.com/WordPress/openverse/pull/4315 but that should be resolved between today and tomorrow. So I'm hoping the process is resumed soon and we can have the files this week.

To do

In the meantime, I can work on the next step:

https://github.com/WordPress/openverse/issues/3415

May 13 '24 15:05 krysal

An image data refresh in production couldn't finish with the changes from #4163, so we added more logging #4358, rolled back prod ingestion server, and decided to perform the cleanup process on the dev environment. An attempt with a data refresh limit resurrected an old problem (#736, #4381), which we already have a fix for, #4382. After merging #4382 on Monday, we must deploy the dev ingestion server and trigger the image data refresh to continue debugging.

The add_license_url DAG presented some issues of time outs as well and was refactored. The PR is pending revision:

#4198

May 24 '24 22:05 krysal

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

Jun 08 '24 00:06 openverse-bot

Done

The project was restructured in the way the updates are expected to be executed (using the batched_update DAG or a similar process mainly) and reducing the scope since now we don't need to remove tags (originated by discussion in #4417)
The add_license_url DAG ran successfully two times with the latest update (#4370), although strangely, a group of rows is receding and losing the value. #4318 was created to track this problem.

In progress

We'll still use the TSV files with fixed URLs from the ingestion Server, so the PR to upload them to S3 is up and ready for revision: #4471
@sarayourfriend created the configuration necessary to fix #4199 in a one-off DAG that, with the last touch of #4473, should be ready to run
#4452 is being discussed and also worked on

To do

I'm working on #3415. I did some manual tests and managed to load a table from the S3 files directly in staging, finding the extension for importing data into an RDS for PostgreSQL extremely useful. It can save us many headaches with networking, local file managing, and potential disk space issues. The second part of the task is actually performing the updates. I'm looking into whether batch updates can be parallelized (as done on the ingestion server).

Jun 14 '24 01:06 krysal

I did some manual tests and managed to load a table from the S3 files directly in staging, finding the extension for importing data into an RDS for PostgreSQL extremely useful.

For what it's worth @krysal, you can definitely test that locally, rather than needing to use a live environment. We use the extension already for iNaturalist, so there are examples in the codebase of how to do it (including with support for local files for testing and development). Check this one out, for example: https://github.com/WordPress/openverse/blob/697f62f01a32cb7fcf2f4a7627650a113cba40da/catalog/dags/providers/provider_csv_load_scripts/inaturalist/observations.sql#L28

Jun 14 '24 03:06 sarayourfriend

@sarayourfriend I did not think of iNaturalist as a reference here, and the relationship had not been mentioned until now. That's good to know! I thought of testing in the staging DB first because, from the documentation, I understood the extension is specifically for an Amazon RDS Postgres instance, so it's excellent information to know it works for local Postgres instance. Thank you!

Jun 14 '24 13:06 krysal

Done

#4495. Unanticipated. Required to simulate working with S3 locally in the catalog.
#3912
- Required additionally: #4471
- The changes were deployed live today, so next week, we should start getting freshly cleaned values directly into S3.
#1566

In progress

#4452
- Partially solved by #4475
#3415. I couldn't work much on this while resolving other issues, but now (hopefully) I'll be able to focus on it.

Jun 28 '24 21:06 krysal

https://github.com/WordPress/openverse/pull/4475
- It wasn't possible to run the DAG, so the approach for #4452 must be changed.
https://github.com/WordPress/openverse/pull/4610
- Created an awaiting revision.

This week maintainers were off from Openverse work so the tasks will be resumed next week.

Jul 12 '24 22:07 krysal

The catalog_cleaner DAG ran for the programmed fields successfully, so it habilitates #1411 and #700 for next week after the data refresh if the process doesn't produce more files with changes :)

Besides that, what remains to do is #4452.

Jul 26 '24 22:07 krysal

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

Aug 10 '24 00:08 openverse-bot

Done

#700 The Ingestion Server was deployed on Wednesday so next week we we will check the time gained with other data refresh run.
#1663

To Do

[ ] #4452 This issue has become more complex than initially planned. It was blocked by #4732, so given this is was resolved it can be resumed and @sarayourfriend expressed interest in continue to solve it.
[x] Verify the next image data refresh runs successfully.

Aug 16 '24 22:08 krysal

Hi @krysal, this project has not received an update comment in 14 days. Please leave an update comment as soon as you can. See the documentation on project updates for more information.

Aug 31 '24 00:08 openverse-bot

@WordPress/openverse-maintainers last week @krysal and I discussed the idea of sunsetting this project, with #4452 extracted out as a standalone issue to be worked on later this year.

In hindsight, this project was defined with two goals that were a bit less clear than we initially thought:

The catalog database (upstream) contains the cleaned data outputs of the current Ingestion Server's cleaning steps

The image Data Refresh process is simplified by reducing significantly cleaning times.

The first goal, in particular, is very open to interpretation and changes over time. Our data will never be perfect; does that mean we need to incorporate every new cleanup action into the scope of this work? That seems untenable.

The goal to remove the cleanup step from the data refresh has been met; I propose we close this project and move on.

If anyone objects: please share. Otherwise, I'll ask @krysal to move the project to success and close this issue next week.

Sep 05 '24 20:09 zackkrida

I agree. The first goal actually is clear (in my reading), in that it specifies the "outputs of the current Ingestion Server's cleaning steps". I think, rather, we've let the scope get away from that boundary of the ingestion server cleaning steps, into a total "data cleaning" project.

Sep 05 '24 21:09 sarayourfriend

Definitely okay closing this out based on that - the rest of the data cleaning issues that come up we can prioritize alongside other work!

Sep 06 '24 16:09 AetherUnbound

This project has been closed and moved to success.

Sep 11 '24 15:09 zackkrida

openverse openverse copied to clipboard

Data normalization

Description

Documents

Milestone / Issues

Prior Art

Future work - Phase Two

Prerequisites

Done

In progress

Added

Done

In progress

To do

Done

In progress

To do

Done

In progress

Done

To Do

openverse
openverse copied to clipboard