indivisible icon indicating copy to clipboard operation
indivisible copied to clipboard

build training dataset

Open pghosh opened this issue 7 years ago • 5 comments

We will build training data for classifying actions in the following ways,

scrap websites and create csv with

  1. action text
  2. action tag

manually screened data from the emails

Dataset will be saved in data.world and used for tagging actions identified in emails.

Websites to start with https://resistancenearme.org/ www.risestronger.org

Business value: This task is there to server as the first step for auto tagging action items . This is data collection. The goal is to create labeled dataset that can be used to train classifiers to auto tag actions. To start with we should use 'event type' from resistancenearme as the tag. The scrapping task should map event text with one to one map. For email we need to analyze the text to see if we can find pattern that makes the text/action fall into a category. idea is if we can identify pattern then we can write scripts to do the tagging. if not we should spin up task to manually go through emails and tag them. Some starting pointers are *Check the email address , some organizations tend to organize certain kind of tasks *See the verbs , that might actullay have something like rally These are just ideas, feel free to add what works and what does not work

pghosh avatar May 05 '17 17:05 pghosh

I'm going to give this a try, anyone that wants to help is welcome.

brucerowan avatar Jul 26 '17 04:07 brucerowan

Was able to successfully create a .py scraper using beautiful soup to scrape the call to actions of risestronger.org. However, there are only 10 items. Will give resistancenearme.org/ a shot today

brucerowan avatar Jul 28 '17 17:07 brucerowan

@brucerowan Any update?

crypdick avatar Aug 27 '17 22:08 crypdick

@crypdick Hi sorry for the delayed response, we've made some good progress. I would check out this repository https://github.com/brucerowan/indivisible/tree/scrap_websites/ingest/web_scraper

brucerowan avatar Aug 29 '17 22:08 brucerowan

@crypdick If you are good at object oriented programming that would actually help me out a lot. Basically, if you could understand how to implement the base_scraper class to the resistancenear me.py file that would help me out a lot. message me @bruce_r on slack.

brucerowan avatar Aug 29 '17 22:08 brucerowan