data icon indicating copy to clipboard operation
data copied to clipboard

[feature] extract all unique blog websites from articles

Open vmeylan opened this issue 10 months ago • 0 comments

TODO

  • Create a script in https://github.com/mev-fyi/data/blob/main/src/populate_csv_files/get_article_content/get_websites_from_articles.py where we extract the unique authors' blog link from all the articles from https://github.com/mev-fyi/data/blob/main/data/links/articles_updated.csv (article header).
  • Create a second script to crawl of posts (URLs) from all websites.
    • Output: all URLs crawled from the authors' blog posts

Example:

https://ethresear.ch/t/burning-mev-through-block-proposer-auctions/14029 -> https://ethresear.ch/t/ https://taiko.mirror.xyz/7dfMydX1FqEx9_sOvhRt3V8hJksKSIWjzhCVu7FyMZ -> https://taiko.mirror.xyz/ https://figmentcapital.medium.com/the-proof-supply-chain-be6a6a884eff -> https://figmentcapital.medium.com/

Helper:

The regexp hashmap url_patterns which identifies whether the link refers directly to an article, or its website e.g. the authors' blog post, available in https://github.com/mev-fyi/data/blob/main/src/populate_csv_files/parse_new_data.py

Challenge:

There can be several matches e.g. some medium authors' blog post are in the format <author>.medium.com/<article> while others are in the format www.medium.com/<author>

End goal:

get all the unique author blog posts. Then we crawl all their websites. Then once all unique articles' URLs are indexed, we scrap all articles and add them to the database.

vmeylan avatar Mar 29 '24 10:03 vmeylan