WebTechnologies
WebTechnologies copied to clipboard
{polite} package and web etiquette
I like the Scraping ethics & legalities section for R for Data Science (2e).
Before we get started discussing the code you’ll need to perform web scraping, we need to talk about whether it’s legal and ethical for you to do so....
I think many R users (like students, statisticians, data scientists) are not as familiar with etiquette & conventions as web developers and most people web scraping. It would be nice if our web scraping section referred the reader to this info, as well as the polite package.
The three pillars of a polite session are seeking permission, taking slowly and never asking twice.
The package builds on awesome toolkits for defining and managing http sessions (httr and rvest), declaring the user agent string and investigating site policies (robotstxt), and utilizing rate-limiting and response caching (ratelimitr and memoise).
@pachadotdev, do you have thoughts? It's not the conventional material for a cran task view. I'm thinking a few sentences and links. Nothing preachy --just pointing them to these resources if the reader wants to educate themselves?
@wibeasley this would be extremely positive in my own case, I have to scrap a lot of data, so I can write a part after apr 21
@wibeasley I have a draft from a workshop I attended. I will put this in a separate branch
https://github.com/cran-task-views/WebTechnologies/tree/511
@pachadotdev, I like it. I think it will be helpful to some audiences.
Are you writing it in a separate file, and later combining it into the Task View when you're satisfied?
I made converted it to semantic line breaks, which I've found helpful maintaining files that a lot of people touch. I also made a few changes that I hope you like. Reject anything you think doesn't improve the clarity.
thanks! yes, I put that in a separate file
Will it stay in a separate file, or be integrated into the Task View?
If it stays in a separate file, I think the Task View should link to the page you wrote.
the idea should be to include it in the readme, once it's ready
Thanks for putting this together, I think this is very useful!
However, this should be in the task view, not in the README. The README is just in the GitHub repository and the main page that readers will consult is the task view itself, typically on a CRAN mirror. So please put it into the task view itself when you think it is ready.