webchem
webchem copied to clipboard
Add AcTOR query and img function
Pull Request
That's the first part of the the PR to include the AcTOR data source into webchem (Issue #209). It's not yet finished (documentation etc. missing) and here for discussion.
I haven't found any non-allowances and generally the EPA has rather open policies about their data, though this is still web scraping and not an official API. Probably best to ask them.
Once we have decided on this source, I will update the PR.
PR task list:
- [ ] Update NEWS
- [ ] Add tests (if appropriate)
- [ ] Update documentation with
devtools::document() - [ ] Check package passed
Ohoh, I searched a little more and found a robots.txt here: https://www.epa.gov/robots.txt stating: Disallow: ACToR Though the robots.txt also states to aim to prevent crawling, not scraping
Ohoh, I searched a little more and found a robots.txt here: https://www.epa.gov/robots.txt stating: Disallow: ACToR Though the robots.txt also states to aim to prevent crawling, not scraping
The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt
Since they've gone through the trouble of making the database available in their "download" tab, they probably don't mind web scraping, but I think it would be better to ask.
Ohoh, I searched a little more and found a robots.txt here: https://www.epa.gov/robots.txt stating: Disallow: ACToR Though the robots.txt also states to aim to prevent crawling, not scraping
The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt
Since they've gone through the trouble of making the database available in their "download" tab, they probably don't mind web scraping, but I think it would be better to ask.
I have sent them a mail. Let's just wait for the reply.
The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt
Do you think this affects the ACToR web service as well? https://actorws.epa.gov/actorws/
The robots.txt here also disallows all user agents from scraping: https://actor.epa.gov/robots.txt
Do you think this affects the ACToR web service as well? https://actorws.epa.gov/actorws/
Holy, I have completely not seen the AcTOR webservice! Where have you found it? Seems not too well documented^^. I guess this makes my function obsolete and we could everything via the webservice.
I think the robots.txt doesn't have an influence on a webservice.
It was difficult to find, I admit. I found it through this link: https://actor.epa.gov/actor/download.xhtml
It was difficult to find, I admit. I found it through this link: https://actor.epa.gov/actor/download.xhtml
Now that's really confusing. I have been at this site several times and always thought that there, one could only download the SQL dump: actor_2015q3.sql.gz. Have never looked into Details. Damn. More eyes see definitely more :)
I wrote them a second mail and asked them about the current state of the web service. Can change the function afterwards to use the web service.