webchem icon indicating copy to clipboard operation
webchem copied to clipboard

Add AcTOR query and img function

Open andschar opened this issue 5 years ago • 7 comments

Pull Request

That's the first part of the the PR to include the AcTOR data source into webchem (Issue #209). It's not yet finished (documentation etc. missing) and here for discussion.

I haven't found any non-allowances and generally the EPA has rather open policies about their data, though this is still web scraping and not an official API. Probably best to ask them.

Once we have decided on this source, I will update the PR.

PR task list:

  • [ ] Update NEWS
  • [ ] Add tests (if appropriate)
  • [ ] Update documentation with devtools::document()
  • [ ] Check package passed

andschar avatar May 05 '20 13:05 andschar

Ohoh, I searched a little more and found a robots.txt here: https://www.epa.gov/robots.txt stating: Disallow: ACToR Though the robots.txt also states to aim to prevent crawling, not scraping

andschar avatar May 05 '20 14:05 andschar

Ohoh, I searched a little more and found a robots.txt here: https://www.epa.gov/robots.txt stating: Disallow: ACToR Though the robots.txt also states to aim to prevent crawling, not scraping

The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt

Since they've gone through the trouble of making the database available in their "download" tab, they probably don't mind web scraping, but I think it would be better to ask.

Aariq avatar May 06 '20 20:05 Aariq

Ohoh, I searched a little more and found a robots.txt here: https://www.epa.gov/robots.txt stating: Disallow: ACToR Though the robots.txt also states to aim to prevent crawling, not scraping

The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt

Since they've gone through the trouble of making the database available in their "download" tab, they probably don't mind web scraping, but I think it would be better to ask.

I have sent them a mail. Let's just wait for the reply.

andschar avatar May 07 '20 10:05 andschar

The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt

Do you think this affects the ACToR web service as well? https://actorws.epa.gov/actorws/

stitam avatar May 07 '20 13:05 stitam

The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt

Do you think this affects the ACToR web service as well? https://actorws.epa.gov/actorws/

Holy, I have completely not seen the AcTOR webservice! Where have you found it? Seems not too well documented^^. I guess this makes my function obsolete and we could everything via the webservice.

I think the robots.txt doesn't have an influence on a webservice.

andschar avatar May 07 '20 13:05 andschar

It was difficult to find, I admit. I found it through this link: https://actor.epa.gov/actor/download.xhtml

stitam avatar May 07 '20 14:05 stitam

It was difficult to find, I admit. I found it through this link: https://actor.epa.gov/actor/download.xhtml

Now that's really confusing. I have been at this site several times and always thought that there, one could only download the SQL dump: actor_2015q3.sql.gz. Have never looked into Details. Damn. More eyes see definitely more :)

I wrote them a second mail and asked them about the current state of the web service. Can change the function afterwards to use the web service.

andschar avatar May 07 '20 15:05 andschar