food-oasis icon indicating copy to clipboard operation
food-oasis copied to clipboard

Explore web scraping as data collecting tool

Open GigiUxR opened this issue 1 year ago • 1 comments

Explore the benefits, feasibility, and efforts necessary to implement an automated web scraping process to collect and parse raw data from relevant websites.

  • [ ] Determine web scraping benefits to FOLA
  • [ ] Determine feasibility of implementing web scraping
  • [ ] Determine whether implementing web scraping is worth the efforts of FOLA
  • [ ] Create issue for first steps of implementing web scraping (assign to @ToledoPHD and @stephzh99 -- they have expressed interest to pick this up) or close out this issue if not moving forward

The idea of automated web scraping came to me while I was considering easier options for data validation. If we follow through with web scraping, it will lead to a branch of possibilities for issue #996 ( in which we could advise pantries to include information on their site per seeker demand as listed in our directory).

GigiUxR avatar Oct 02 '22 20:10 GigiUxR

@GigiUxR Moving this to In Progress.

You might already be thinking this, but I would recommend discussing this with John D to see what is possible here.

Re: your comment in the overview about advising pantries- this is also something we can do by getting feedback from partner organizations, an idea I can roll into the Partnerships sub project.

staceyrebekahscott avatar Oct 18 '22 06:10 staceyrebekahscott

We did a LOT of work in this area in 2019. This competitive analysis document summarizes what other sites at the time were doing, and evaluated them as possible sources of data. The rows at the top were looking for what details our "competitors" had about each listing, and the rows at the bottom indicated each site as a possible source of data.

From a technical perspective, the techniques for importing data are - in order from most reliable and maintainable to least:

  1. Access to an API that allows us to import data in a well-defined structured format (usually XML or JSON),
  2. Access to a downloadable file (preferably a spreadsheet in a grid-like format),
  3. (As a last resort) web scraping their site.

The problems with web scraping are:

  1. The legality is questionable. Many sites have explicit rules about not scraping their data.
  2. The site's html must be structured in such a way that it is scrapable. This means that the elements we want to extract have named html tags that definitively identify the pieces of information. If the site uses paging to display long lists of entries on multiple pages, the programming is a bit more involved to page through all the data.
  3. Sites that use react or other dynamically generated content require a somewhat more sophisticated scraping library capable of running a "headless browser" which can execute the page's javascript to navigate the site.
  4. Each site you want to scrape is a separate programming project tailored to the html layout of the site.
  5. If the site design changes, you have to build the scraping component all over again from scratch.

We actually did some importing and scraping of data for Food Oasis. The ones that I remember off the top of my head are:

  1. We imported data from 211.org via their public API
  2. After several attempts, (I think it was Jabari Brown) got hold of a spreadsheet from LA Regional Food Bank of all their affiliates at the time, which we imported.
  3. We scraped data from the LA Public Library site (https://www.lapl.org/homeless-resources-food)

In each case, the result was a table of imported data. Once this was done, the real fun begins when we try to match the imported entries with our existing data to see if they are listings we already know about or new ones or ones that no longer exist. The most reliable matching tends to be by first normalizing the address to a standard format, running it through a geocoder to get lat/long coordinates and them running an algorithm to try to match to our existing locations. This is more reliable than the name of the pantry. Matching by phone numbers is more definitive but only works some of the time.

Once we decide that an imported listing is a match, then we need to re-organize their fields to match ours and compare values. Then we need to decide if the imported record is more accurate than the one we already have. In the vast majority of cases, our information is newer.

For imported records that don't match an existing listing in Food Oasis, the question is whether it is still open, does it lie within the county, would it be properly categorized as a Food Pantry or Meal Program. If there is enough contact information in the imported record, then we would need to contact the pantry to gather at least the minimum amount of detail and add it to the listing to use it in Food Oasis.

If we had encountered a good source that was reliable and up-to-date and had enough of the fields we needed to be useful, then the next steps would have been to work out a process to automate the above steps to re-import the data on a regular basis, but we never found a source worth importing.

IMO, it is worthwhile to keep looking for a definitive data source. If one can be found, then we should explore what the best process is for obtaining the data and merging it with the Food Oasis data.

entrotech avatar Nov 03 '22 17:11 entrotech

@entrotech @staceyrebekahscott Yes, I like the idea of identifying a definitive data source -- this will reduce the effort required for web scraping different sites.

To build off this idea, I read on a random website that: food pantries are nonprofit organizations that must abide by state and federal regulations. USDA requires agencies such as the Department of Human Services to regularly evaluate food pantries but this can vary state to state.

Therefore, somewhere there is a database of food pantries and possibly more. Here are some questions that come to mind:

  • Can we find out who are the regulatory agency/agencies involved in inspecting such places that distribute food to the public? (Tax related, health related, etc)
  • Are other free food distributing services under similar regulations?
  • Who within the California network can we speak with that would be able to tell us more about this?
  • Or should we contact a state regulatory agency directly to begin asking?
  • https://www.chhs.ca.gov - We could begin our inquiry with these folks?

GigiUxR avatar Nov 03 '22 19:11 GigiUxR

Starting at the top, much of the food distributed by pantries and meal programs is sourced from the USDA's Food and Nutrition Service (https://www.fns.usda.gov/). There are several programs that they administer, including TEFAP (The Emergency Food Assistance Program). Most nutrition assistance programs funded by FNS are administered at the state, territory, tribal, or local levels. Choosing California from the drop-down on their home page takes you to a page you can use to search by state and program. Using this, you can find the following contact info for the California TEFAP program: https://www.fns.usda.gov/fns-contacts?f%5B0%5D=fns_contact_state%3A286&f%5B1%5D=fns_contact_related_programs%3A27

Which leads to the California Department of Social Services Contacts at https://www.cdss.ca.gov/inforesources/fdu

Which leads to the list of TEFAP Providers here: https://www.cdss.ca.gov/inforesources/efap/stakeholders, which lists the LA County providers as

  • Los Angeles Regional Food Bank https://www.lafoodbank.org/
  • Food Bank of Southern California https://foodbankofsocal.org/

We have tried working with the LARFB, with very limited success as I mentioned in my last comment, but we could try building a better relationship with them to see if we can persuade them into sharing their data. They have a food finder page on their site (https://www.lafoodbank.org/find-food/pantry-locator/), but it isn't as good as ours, so it could be mutually beneficial if we were to provide our widget for their page, and cooperate on keeping the listings up to date. FWIW, I volunteered once for LARFB and spent a few hours gleaning onions.

I thought LA Regional Food bank was the sole provider of TEFAP food in LA County, so it was interesting to find the Food Bank of Southern California. To my knowledge, we have not contacted or tried to work with the Food Bank of Southern California.
Though they do not have a listing of their outlets (they call them agencies) on their web site, we should probably try to contact Food Bank of Southern California and see if they might be willing to share information about their agencies with us, and offer to provide our widget for their site.

entrotech avatar Nov 03 '22 22:11 entrotech

More questions:

  • I am very much coveting an existing graphic of this food flow (to be adapted for FOLA and saved in an info repository along with @entrotech explanations). If anyone knows of a graphic, please enter a link here. Otherwise, I will begin an online search. Or create one.
  • Are non-USDA sourced food pantries and programs also under state and federal regulations and therefore in the same database(s) as USDA sourced pantries/programs?

GigiUxR avatar Nov 04 '22 16:11 GigiUxR

@entrotech Thank you for all this terrific information.

@GigiUxR I would like to incorporate this into the data validation process project that has been discussed, but I am not yet ready to start that planning process. I am moving this into the Prioritized Backlog for now. I intend to get started on that planning in the next few weeks, and at that point I would very much like to continue working with you and the UX Research team on this.

staceyrebekahscott avatar Nov 09 '22 04:11 staceyrebekahscott

  • [ ] Issue overview will need to be re-written to include scope of data validation process project, to be worked on with Gigi and UX Research team.

staceyrebekahscott avatar Nov 09 '22 04:11 staceyrebekahscott