Pythonista-Issues
Pythonista-Issues copied to clipboard
lxml module
Implementing lxml would make web scraping a whole lot easier and faster.
I may not be understanding correctly but couldn't you just pip install it via StaSh?
@Hum4n01d That wouldn't work for lxml, as it's not pure Python.
@scj643 I usually use BeautifulSoup for web scraping, what makes lxml better/easier?
@omz
The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API."
Hi @omz, since you're here, I just want to say thanks so much for building Pythonista! It's really cool and fun for side projects that I can work on from anywhere!
@omz As far as I can tell, lxml has to be installed for BeautifulSoup to be able to parse in XML mode (i. e. bs4.BeautifulSoup(source, "xml")). By default BS4 parses the source as HTML, which doesn't work properly for all XML, because a HTML parser makes assumptions about the meanings of some tags (where they may appear, whether they can/must be empty, etc.), whereas in XML tags have no predefined meaning.
Also I use lxml to neuter certain pages for making documentation
Libxml is part of the iOS SDK so headers are provided by Apple
Having lxml would be great!
It looks like Kivy for iOS previously had a recipe for getting lxml into apps, but it hasn't been ported to their current toolchain. But it looks like it should be possible to include in Pythonista at least.
https://github.com/kivy/kivy-ios
https://groups.google.com/forum/?nomobile=true#!topic/kivy-dev/86W5bPqrEUw
+1 for lxml. It's a dependency for many libraries I want to use with Pythonista.
If you use the lxml module outside of Pythonista, check out the performance boosts in version 4... http://lxml.de/4.0/changes-4.0.0.html
+1 for lxml. Is there an update on when we can expect this to be included in Pythonista? It's been almost a year now since this issue/ticket was opened up. Is it still in the works to be included or has it been rejected?
+1 lxml. Please include lxml module into pythonista.
+1 lxml. I have a package with a big dependency I can't work around.
+1 lxml. I also have been stuck several times trying to install on iPhone and iPad some packages that depended on lxml. Would be great.
+1 lxml. I use the docx-mailmerge module which depends on it and I would like to be able to use it on my iPhone.
+1 lxml
I have a want to use some libraries for working with networking gear. Many of them use lxml to parse structured data returned from the networking devices.
+1, then I can use python-pptx to process the ppt file.
https://develobile.com/pyto has lxml if you need it.
https://develobile.com/pyto has lxml if you need it.
"@ColdGrub1384 Remove lxml (see lxml branch and #25)"
https://github.com/ColdGrub1384/Pyto/issues/25
@cclauss they had removed lxml 9 hours after your post.
Maybe it is impossible to have lxml in iOS (forever?)
https://github.com/ColdGrub1384/Pyto/issues/25
Maybe lxml will not be possible on the App Store, lxml depends on libxml, libxslt and libexslt C libraries. They are already included on Xcode, no need to compile them. But, lxml calls many functions of libxslt and libexslt that aren't defined on header files (they are defined by headers inside lxml) and Apple says they are Private APIs.
@goldengrape I created a branch with lxml as it works perfectly, it's just the App Store that rejects it.
For BeautifulSoup, the html5lib module might be slower than lxml but it performs adequately for web scraping on both Pythonista and Pyto.
A potential work around would probably involve rebuilding lxml from the ground up so any dependencies get linked in the xcode project.
I have to rebuild libxslt and libexslt (lxml dependencies) and rename functions flagged as private APIs by Apple. Then link these libraries and re-compile lxml with renamed functions.
Also known as refactoring :)
Another big plus for lxml that I haven’t seen in the replies here yet is the XPath support. Makes getting fragments of html/xml sources a lot easier with carefully crafted xpaths.
Any progress with lxml ? 😁
+1 for lxml