apple-news-scraper
apple-news-scraper copied to clipboard
Code used for collecting and saving the Top Stories and Trending Stories in Apple News via Appium.
Apple News Scraper
This repository provides code and data used in the following paper:
Bandy, Jack and Nicholas Diakopoulos. "Auditing News Curation Systems: A Case Study Examining Algorithmic and Editorial Logic in Apple News." To Appear in Proceedings of the Fourteenth International AAAI Conference on Web and Social Media (ICWSM 2020).
Installation and Setup Instructions
Install Appium
Download appium-desktop: https://github.com/appium/appium-desktop/releases/latest (You can try the brew/npm installation - https://appium.io - but those releases have been buggier in my experience)
And the python client: pip install Appium-Python-Client
Also, carthage (another dependency) often does not install automatically. Run brew install carthage
to be sure.
Install apple-news-scraper
After cloning this repository onto your computer,
- Run
instruments -s devices
in your terminal - Choose a device, something like
iPhone XS Max (12.1) [XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX] (Simulator)
- Open
get_stories.py
and replace the first few lines with your device information. Afterwards, it may look something like:
# user-defined variables
device_name_and_os = 'iPhone XS Max (12.1)'
device_os = '12.1'
udid = 'XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX'
- Change the output folder, where you want to save the data:
# easy relative path, keep data in repository
output_folder = 'data_output/'
or,
# put data in a folder on the desktop
output_folder = '~/apple_news_data/'
Execution
First, run the simulator of choice and open the Apple News app: instruments -w "iPhone XS Max (12.1)"
Execution should be as easy as python get_stories.py
To run repeatedly, I recommend cron. Just make sure you use absolute paths. For example, to run collection every five minutes, add something like this to your crontab:
*/5 * * * * /usr/local/bin/python /Users/jack/dev/apple-news-scraper/get_stories.py
If you're in a hurry, you can also just hack out a shell script:
while true
do
python get_stories.py &
sleep 300
done