backend icon indicating copy to clipboard operation
backend copied to clipboard

Implement workflow to merge feeds into each other

Open pypt opened this issue 4 years ago • 1 comments

So, now that we came up with lists of media sources / feeds to be merged into each other (#799), let's try doing the actual merging.

Given that:

  • We need to merge some individual feeds into each other, and also merge some media sources into each other (the process of which includes merging feeds), as I've mentioned in https://github.com/mediacloud/backend/issues/799#issuecomment-925873281;
  • Temporal workflows is a bit new to our stack and neither of us have extensive hands-on knowledge with it;
  • General avoidance of scope creep;
  • The fact that merging feeds into each other is useful to us alone;

I'd suggest that we implement the feed merging first, run it with inputs from feed_actions, see what happens, correct course, and then move on to merging the media sources as a separate task.

As per https://github.com/mediacloud/backend/issues/799#issuecomment-948678606, the final database of what gets merged into what is:

https://drive.google.com/file/d/1sfQLMwq5OkooDtg3ZjYOTOyNEIzMv2HZ/view?usp=sharing

and for this task we'll need just the feed_actions table.

Outline

The workflow to merge one feed (<src_feeds_id>) into another (<dst_feeds_id>) will look as follows (adapted from https://github.com/mediacloud/backend/issues/799#issuecomment-925873281):

  1. Move all rows that reference the feeds table with <src_feeds_id>:
    1. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the downloads table
    2. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_stories_map table, taking into account that there could be duplicates
    3. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the scraped_feeds table, taking into account that there could be duplicates
    4. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_from_yesterday table, taking into account that there could be duplicates
    5. Set feeds_id = <dst_feeds_id> on rows with feeds_id = <src_feeds_id> in the feeds_tags_map table, taking into account that there could be duplicates
  2. Remove the row with feeds_id = <src_feeds_id> from the feeds table:
    1. Remove rows with feeds_id = <src_feeds_id> from the downloads table - there shouldn't be any left as we've just merged them
    2. Remove rows with feeds_id = <src_feeds_id> from the feeds_stories_map table - there shouldn't be any left as we've just merged them
    3. <...>
    4. Remove rows with feeds_id = <src_feeds_id> from the feeds_tags_map table - there shouldn't be any left as we've just merged them
    5. Remove the actual row from feeds.

Referencing tables

Merging feeds is a bit easier than merging media sources because the feed information don't end up on Solr (so we don't have to update its index in any way), there aren't that many tables that reference rows in feeds (open up https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/postgresql-server/pgmigrate/migrations/V0001__initial_schema.sql and Command+F for feeds_id).

Here's how it looks like on production (feel free to SSH in and look around yourself):

$ ssh woodward

woodward$ docker exec -it $(docker ps | grep postgresql-server | cut -d ' ' -f1) psql

psql# \d+ feeds

                                                                       Table "public.feeds"
            Column             |           Type           | Collation | Nullable |                 Default                 | Storage  | Stats target | Description 
-------------------------------+--------------------------+-----------+----------+-----------------------------------------+----------+--------------+-------------
 feeds_id                      | integer                  |           | not null | nextval('feeds_feeds_id_seq'::regclass) | plain    |              | 
 media_id                      | integer                  |           | not null |                                         | plain    |              | 
 name                          | character varying(512)   |           | not null |                                         | extended |              | 
 url                           | character varying(1024)  |           | not null |                                         | extended |              | 
 last_attempted_download_time  | timestamp with time zone |           |          |                                         | plain    |              | 
 type                          | feed_type                |           | not null | 'syndicated'::feed_type                 | plain    |              | 
 last_new_story_time           | timestamp with time zone |           |          |                                         | plain    |              | 
 last_checksum                 | text                     |           |          |                                         | extended |              | 
 last_successful_download_time | timestamp with time zone |           |          |                                         | plain    |              | 
 active                        | boolean                  |           | not null | true                                    | plain    |              | 
Indexes:
    "feeds_pkey" PRIMARY KEY, btree (feeds_id)
    "feeds_last_attempted_download_time" btree (last_attempted_download_time)
    "feeds_last_successful_download_time" btree (last_successful_download_time)
    "feeds_media" btree (media_id)
    "feeds_name" btree (name)
Foreign-key constraints:
    "feeds_media_id_fkey" FOREIGN KEY (media_id) REFERENCES media(media_id) ON DELETE CASCADE
Referenced by:
    TABLE "downloads" CONSTRAINT "downloads_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id)
    TABLE "feeds_stories_map_p_00" CONSTRAINT "feeds_stories_map_p_00_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_01" CONSTRAINT "feeds_stories_map_p_01_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_02" CONSTRAINT "feeds_stories_map_p_02_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_03" CONSTRAINT "feeds_stories_map_p_03_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_04" CONSTRAINT "feeds_stories_map_p_04_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_05" CONSTRAINT "feeds_stories_map_p_05_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_06" CONSTRAINT "feeds_stories_map_p_06_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_07" CONSTRAINT "feeds_stories_map_p_07_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_08" CONSTRAINT "feeds_stories_map_p_08_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_09" CONSTRAINT "feeds_stories_map_p_09_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_10" CONSTRAINT "feeds_stories_map_p_10_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_11" CONSTRAINT "feeds_stories_map_p_11_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_12" CONSTRAINT "feeds_stories_map_p_12_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_13" CONSTRAINT "feeds_stories_map_p_13_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_14" CONSTRAINT "feeds_stories_map_p_14_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_15" CONSTRAINT "feeds_stories_map_p_15_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_16" CONSTRAINT "feeds_stories_map_p_16_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_17" CONSTRAINT "feeds_stories_map_p_17_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_18" CONSTRAINT "feeds_stories_map_p_18_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_19" CONSTRAINT "feeds_stories_map_p_19_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_20" CONSTRAINT "feeds_stories_map_p_20_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_stories_map_p_21" CONSTRAINT "feeds_stories_map_p_21_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) MATCH FULL ON DELETE CASCADE
    TABLE "feeds_tags_map" CONSTRAINT "feeds_tags_map_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) ON DELETE CASCADE
    TABLE "scraped_feeds" CONSTRAINT "scraped_feeds_feeds_id_fkey" FOREIGN KEY (feeds_id) REFERENCES feeds(feeds_id) ON DELETE CASCADE
Access method: heap

So, a bunch of non-partitioned and partitioned tables reference feeds.feeds_id. One table that's missing from this list is feeds_from_yesterday - I remember that the lack of reference is deliberate, I just don't remember why :)

Here are the sizes of all of these tables:

psql# \d+

<...>

 public | downloads                                                       | partitioned table | mediacloud | permanent   | 0 bytes    | 
 public | downloads_error                                                 | table             | mediacloud | permanent   | 34 GB      | 
 public | downloads_feed_error                                            | table             | mediacloud | permanent   | 5828 MB    | 
 public | downloads_fetching                                              | table             | mediacloud | permanent   | 11 MB      | 
 public | downloads_in_past_day                                           | view              | mediacloud | permanent   | 0 bytes    | 
 public | downloads_media                                                 | view              | mediacloud | permanent   | 0 bytes    | 
 public | downloads_non_media                                             | view              | mediacloud | permanent   | 0 bytes    | 
 public | downloads_pending                                               | table             | mediacloud | permanent   | 11 MB      | 
 public | downloads_success                                               | partitioned table | mediacloud | permanent   | 0 bytes    | 
 public | downloads_success_content                                       | partitioned table | mediacloud | permanent   | 0 bytes    | 
 public | downloads_success_content_00                                    | table             | mediacloud | permanent   | 5322 MB    | 
 public | downloads_success_content_01                                    | table             | mediacloud | permanent   | 7182 MB    | 
 public | downloads_success_content_02                                    | table             | mediacloud | permanent   | 11 GB      | 
 public | downloads_success_content_03                                    | table             | mediacloud | permanent   | 12 GB      | 
 public | downloads_success_content_04                                    | table             | mediacloud | permanent   | 10 GB      | 
 public | downloads_success_content_05                                    | table             | mediacloud | permanent   | 10147 MB   | 
 public | downloads_success_content_06                                    | table             | mediacloud | permanent   | 12 GB      | 
 public | downloads_success_content_07                                    | table             | mediacloud | permanent   | 13 GB      | 
 public | downloads_success_content_08                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_09                                    | table             | mediacloud | permanent   | 13 GB      | 
 public | downloads_success_content_10                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_11                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_12                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_13                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_14                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_15                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_16                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_17                                    | table             | mediacloud | permanent   | 16 GB      | 
 public | downloads_success_content_18                                    | table             | mediacloud | permanent   | 16 GB      | 
 public | downloads_success_content_19                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_20                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_21                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_22                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_23                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_24                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_25                                    | table             | mediacloud | permanent   | 14 GB      | 
 public | downloads_success_content_26                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_27                                    | table             | mediacloud | permanent   | 16 GB      | 
 public | downloads_success_content_28                                    | table             | mediacloud | permanent   | 18 GB      | 
 public | downloads_success_content_29                                    | table             | mediacloud | permanent   | 19 GB      | 
 public | downloads_success_content_30                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_31                                    | table             | mediacloud | permanent   | 15 GB      | 
 public | downloads_success_content_32                                    | table             | mediacloud | permanent   | 9981 MB    | 
 public | downloads_success_content_33                                    | table             | mediacloud | permanent   | 8192 bytes | 
 public | downloads_success_feed                                          | partitioned table | mediacloud | permanent   | 0 bytes    | 
 public | downloads_success_feed_00                                       | table             | mediacloud | permanent   | 8366 MB    | 
 public | downloads_success_feed_01                                       | table             | mediacloud | permanent   | 5427 MB    | 
 public | downloads_success_feed_02                                       | table             | mediacloud | permanent   | 7832 MB    | 
 public | downloads_success_feed_03                                       | table             | mediacloud | permanent   | 5881 MB    | 
 public | downloads_success_feed_04                                       | table             | mediacloud | permanent   | 6155 MB    | 
 public | downloads_success_feed_05                                       | table             | mediacloud | permanent   | 6575 MB    | 
 public | downloads_success_feed_06                                       | table             | mediacloud | permanent   | 5742 MB    | 
 public | downloads_success_feed_07                                       | table             | mediacloud | permanent   | 5534 MB    | 
 public | downloads_success_feed_08                                       | table             | mediacloud | permanent   | 4877 MB    | 
 public | downloads_success_feed_09                                       | table             | mediacloud | permanent   | 5132 MB    | 
 public | downloads_success_feed_10                                       | table             | mediacloud | permanent   | 5217 MB    | 
 public | downloads_success_feed_11                                       | table             | mediacloud | permanent   | 5072 MB    | 
 public | downloads_success_feed_12                                       | table             | mediacloud | permanent   | 5091 MB    | 
 public | downloads_success_feed_13                                       | table             | mediacloud | permanent   | 5257 MB    | 
 public | downloads_success_feed_14                                       | table             | mediacloud | permanent   | 5209 MB    | 
 public | downloads_success_feed_15                                       | table             | mediacloud | permanent   | 5229 MB    | 
 public | downloads_success_feed_16                                       | table             | mediacloud | permanent   | 3821 MB    | 
 public | downloads_success_feed_17                                       | table             | mediacloud | permanent   | 3078 MB    | 
 public | downloads_success_feed_18                                       | table             | mediacloud | permanent   | 3128 MB    | 
 public | downloads_success_feed_19                                       | table             | mediacloud | permanent   | 4509 MB    | 
 public | downloads_success_feed_20                                       | table             | mediacloud | permanent   | 5425 MB    | 
 public | downloads_success_feed_21                                       | table             | mediacloud | permanent   | 5463 MB    | 
 public | downloads_success_feed_22                                       | table             | mediacloud | permanent   | 5793 MB    | 
 public | downloads_success_feed_23                                       | table             | mediacloud | permanent   | 5041 MB    | 
 public | downloads_success_feed_24                                       | table             | mediacloud | permanent   | 5319 MB    | 
 public | downloads_success_feed_25                                       | table             | mediacloud | permanent   | 5468 MB    | 
 public | downloads_success_feed_26                                       | table             | mediacloud | permanent   | 5396 MB    | 
 public | downloads_success_feed_27                                       | table             | mediacloud | permanent   | 5224 MB    | 
 public | downloads_success_feed_28                                       | table             | mediacloud | permanent   | 5124 MB    | 
 public | downloads_success_feed_29                                       | table             | mediacloud | permanent   | 4987 MB    | 
 public | downloads_success_feed_30                                       | table             | mediacloud | permanent   | 5155 MB    | 
 public | downloads_success_feed_31                                       | table             | mediacloud | permanent   | 5228 MB    | 
 public | downloads_success_feed_32                                       | table             | mediacloud | permanent   | 3393 MB    | 
 public | downloads_success_feed_33                                       | table             | mediacloud | permanent   | 8192 bytes | 

downloads is partitioned by state column and then further by type, so, for example, a download with state = 'success' and type = 'feed' would end up in one of the downloads_success_feed tables.

For your purposes I think you can pretty much ignore the fact that it's partitioned and just UPDATE the base downloads table directly.

Another partitioned table is feeds_stories_map:

 public | feeds_stories_map                                               | view              | mediacloud | permanent   | 0 bytes    | 
 public | feeds_stories_map_p                                             | table             | mediacloud | permanent   | 0 bytes    | 
 public | feeds_stories_map_p_00                                          | table             | mediacloud | permanent   | 3148 MB    | 
 public | feeds_stories_map_p_01                                          | table             | mediacloud | permanent   | 4078 MB    | 
 public | feeds_stories_map_p_02                                          | table             | mediacloud | permanent   | 4079 MB    | 
 public | feeds_stories_map_p_03                                          | table             | mediacloud | permanent   | 4206 MB    | 
 public | feeds_stories_map_p_04                                          | table             | mediacloud | permanent   | 4208 MB    | 
 public | feeds_stories_map_p_05                                          | table             | mediacloud | permanent   | 4219 MB    | 
 public | feeds_stories_map_p_06                                          | table             | mediacloud | permanent   | 4223 MB    | 
 public | feeds_stories_map_p_07                                          | table             | mediacloud | permanent   | 4220 MB    | 
 public | feeds_stories_map_p_08                                          | table             | mediacloud | permanent   | 4217 MB    | 
 public | feeds_stories_map_p_09                                          | table             | mediacloud | permanent   | 4221 MB    | 
 public | feeds_stories_map_p_10                                          | table             | mediacloud | permanent   | 4227 MB    | 
 public | feeds_stories_map_p_11                                          | table             | mediacloud | permanent   | 4211 MB    | 
 public | feeds_stories_map_p_12                                          | table             | mediacloud | permanent   | 4173 MB    | 
 public | feeds_stories_map_p_13                                          | table             | mediacloud | permanent   | 4424 MB    | 
 public | feeds_stories_map_p_14                                          | table             | mediacloud | permanent   | 4293 MB    | 
 public | feeds_stories_map_p_15                                          | table             | mediacloud | permanent   | 4241 MB    | 
 public | feeds_stories_map_p_16                                          | table             | mediacloud | permanent   | 4251 MB    | 
 public | feeds_stories_map_p_17                                          | table             | mediacloud | permanent   | 4210 MB    | 
 public | feeds_stories_map_p_18                                          | table             | mediacloud | permanent   | 4215 MB    | 
 public | feeds_stories_map_p_19                                          | table             | mediacloud | permanent   | 4282 MB    | 
 public | feeds_stories_map_p_20                                          | table             | mediacloud | permanent   | 3383 MB    | 
 public | feeds_stories_map_p_21                                          | table             | mediacloud | permanent   | 0 bytes    | 

And then there are a few smaller tables:

 public | feeds_tags_map                                                  | table             | mediacloud | permanent   | 696 kB     | 

<...>

 public | scraped_feeds                                                   | table             | mediacloud | permanent   | 10072 kB   | 

<...>

 public | feeds_from_yesterday                                            | table             | mediacloud | permanent   | 36 MB      | 

Tips, tricks, notes and other things that came to mind

  • Bigger tables might have thousands if not hundreds of thousands of rows that reference feeds.feeds_id so you'll need to chunk your UPDATEs somehow. One way to do this is do get MIN(primary_key) and MAX(primary_key) from every referencing table with a specific feeds_id (make sure that an index exists that would allow you to do this in a timely manner!), and then UPDATE the referencing table in chunks based on primary_key.

  • Given the "do this, if that succeeds then do that, then ..., and make sure that it all works for thousands of inputs, and you better track progress of all of it, oh, and external components to be updated might go down at any point, and also it's unclear whether individual steps to be executed will work with production's amount of data" nature of this task, I think this is a good chance to try out Temporal. You can use my podcast ingest as a reference:

    • Interface: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/podcast-transcribe-episode/src/python/podcast_transcribe_episode/workflow_interface.py
    • Implementation: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/podcast-transcribe-episode/src/python/podcast_transcribe_episode/workflow.py
    • Test that writes some mock data, runs the workflow and checks what came out on the other end: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/apps/podcast-transcribe-episode/tests/python/test_workflow.py#L80-L183
    • Some docs: https://github.com/mediacloud/backend/blob/f0c523e7c10ba29f11411e6b105e65d6b17dd036/doc/workflows.markdown
  • IMHO this is one of these tasks that become easier if you write yourself a good test that confirms that your code is doing what you want it to do exactly. So, make sure to write at least one good test that preloads a testing database with some mock duplicate feeds (to both feeds and other referencing tables), runs the workflow and makes sure that feeds got merged and nothing got lost in the process. The test should also test out UPDATEs in chunks and other border cases that come to mind.

As always, complain loudly and early if something's unclear!

pypt avatar Oct 21 '21 15:10 pypt

As for the workflow-as-in-the-how-do-we-code-this-thing-together-workflow, whenever you come up with at least the pre-barebones / scaffolding of what would eventually end up being the feed merging workflow, could you submit a PR and then we'll iteratively stare at it and work it out together?

Or do you have any other ideas on how we could go about implementing this?

pypt avatar Oct 21 '21 15:10 pypt