arxiv-crawler
arxiv-crawler copied to clipboard
crawling arXiv paper and organize as a database
arxiv-crawler
Crawl arXiv paper and organize as a database
Modifying crawling range
# crawler.py
fields = ['CV']
months = ['{:0>2d}'.format(i+1) for i in range(12)]
years = ['{:0>2d}'.format(i) for i in range(6, 17)]
Launch the crawler
$ python crawler.py
Retrieving http://arxiv.org/list/cs.CV/0601?show=1000
...
Check the results
$ python
>>> import sqlite3
>>> conn = sqlite3.connect('arxiv_raw.sqlite')
>>> cur = conn.cursor()
>>> cur.execute('SELECT * FROM sqlite_master')
>>> print cur.fetchall() # print the information for all tables
Future work
Still figuring the best way to visualize papers