parslepy
parslepy copied to clipboard
Proof-of-concept shell based on IPython
Hey, @redapple !
Before anything, I'd like to reinforce that this is just a proof-of-concept: I know the code is far from perfect and the ideas here will need some lapidating if we decide to go this route. The purpose is more like starting the discussion. :)
I started this because my usual workflow when doing scraping involves a lot of work in the shell: that's where I end up testing most of the CSS/XPath expressions. I think this is in part because one ends up distrusting the browser tools for creating selectors after trying to scrape from sites with heavy Javascript usage, so even when using an extension like XPath Helper you usually still want to double-check the expressions in the shell.
So, I don't know if many other scraping devs also do that, this PoC is supposed to help users that are a bit like me. =)
Here is a sample session of using it:
$ python shell.py /tmp/my_so_parselet.json http://stackoverflow.com
Available functions:
test(expression) - test a CSS or XPath expression
fetch(url, [parselet]) - fetch a new URL, optionally starts new parselet
extract() - run the extraction using current parselet rules
add(name, expr) - add a simple property to current parselet
add_list(name, expr) - add a list property to current parselet
add_object(name, expr) - add an object property to current parselet
add_nested(parent, name, expr) - add a property to a list or object property
save() - saves current parselet to disk
>>> # great, shell is loaded and offering a few functions for me to try
>>> test('.question-summary')
[u'0 votes 0 answers 12 views Segfault upon instantiation of object with vector<vector<int> > member c++ vector segmentation-fault modified 43 secs ago Schilcote 390',
u'3 votes 1 answer 389 views Filter out metadata fields and only return source fields in elasticsearch elasticsearch modified 53 secs ago reevesy 2,586',
... LONG LIST HERE ...
u'1 vote 0 answers 3 views Attempt At Seeding With Faker and TestDummy For Efficiency php laravel laravel-5 modified 1 hour ago user3732216 41']
>>>
>>> # okay, so I'll add a list named "questions" for these
>>> add_list('questions', '.question-summary')
>>>
>>> # now I have to add elements to the list...
>>> test('.question-summary h3')
[u'Segfault upon instantiation of object with vector<vector<int> > member',
u'Filter out metadata fields and only return source fields in elasticsearch',
u'Why am I getting different results when using a list comprehension with coroutines with asyncio?',
...
]
>>> # great, it seems that I can get the title using h3, so let's add that
>>> add_nested('questions', 'title', 'h3')
>>> # let's try extracting now...
>>> extract()
{'questions': [{'title': u'Segfault upon instantiation of object with vector<vector<int> > member'},
{'title': u'Filter out metadata fields and only return source fields in elasticsearch'},
{'title': u'Why am I getting different results when using a list comprehension with coroutines with asyncio?'},
...
{'title': u'Attempt At Seeding With Faker and TestDummy For Efficiency'}]}
>>>
>>> # cool, it works! let's add the link:
>>> add_nested('questions', 'link', './/h3//a/@href')
>>> extract()
{'questions': [{'link': '/questions/29336596/segfault-upon-instantiation-of-object-with-vectorvectorint-member',
'title': u'Segfault upon instantiation of object with vector<vector<int> > member'},
{'link': '/questions/23283033/filter-out-metadata-fields-and-only-return-source-fields-in-elasticsearch',
'title': u'Filter out metadata fields and only return source fields in elasticsearch'},
....
>>> # and the list of tags:
>>> add_nested('questions', 'tags', '.tags')
>>> extract()
{'questions': [{'link': '/questions/29336596/segfault-upon-instantiation-of-object-with-vectorvectorint-member',
'tags': u'c++ vector segmentation-fault',
'title': u'Segfault upon instantiation of object with vector<vector<int> > member'},
{'link': '/questions/23283033/filter-out-metadata-fields-and-only-return-source-fields-in-elasticsearch',
'tags': u'elasticsearch',
'title': u'Filter out metadata fields and only return source fields in elasticsearch'},
....
>>> # now let's save our work:
>>> save()
Wrote parselet: /tmp/my_so_parselet.json
>>> ^D
$ cat /tmp/my_so_parselet.json
{
"questions(.question-summary)": [
{
"tags": ".tags",
"link": ".//h3//a/@href",
"title": "h3"
}
]
}
Okay, that was fun, but it also shows a few problems in the current implementation:
- the output of
test()
shows only the text, but often you'd want to see the actual HTML - it only supports one nevel of nesting, which is kind of silly.
For problem 1, I think a proper solution would be to offer two functions for testing, but I can't think of good names... :P
For problem 2, I think a proper solution would involve to enable switching between contexts among the levels, so you could be "inside" the list selector, try out and add expressions for that context, and then go back up and continue on the upper level. This will require some more thinking, so I'll wait for your feedback before pursuing it.
Overall, it's pretty cool to be able to test expressions without worrying if they're CSS or XPath and the feedback loop in the shell fells so much tighter then trying out expressions in the browser/shell and then editing the parselet in the text editor.
I realize this was pretty long, sorry about that! Please let me know what you think. =)