androguard
androguard copied to clipboard
faster APK parsing?
I've briefly looked at faster parsing for fdroidserver.
This seems like something that could be useful to add directly to androguard instead of only using it in fdroidserver.
I'd be happy to (clean up that code a bit and) draft a PR if you'd like.
cc @eighthave
yeah that would be good to have here. androguard is widely used and forked, I think it currently not actively developed though, so merge requests could be a slow process.
Is still interested ? I'm back on it
yes definitely! A pure python fast APK parser would be great to have. I think specifically directly parsing the binary AndroidManifest.xml without first converting to XML would be the biggest win
pyaxmlparser is a project which does it already - https://github.com/appknox/pyaxmlparser
I think fdroid already uses it
Not sure what is the difference from pyaxmlparser and the current code ? As it seems pyaxmlparser is using my code.
@eighthave @subho007 what are the improvements from fdroidserver ?
doesn't pyxmlparser convert to XML first? Or does it have a way to directly read values from the binary without the intermediate XML conversion?
Uhm no you need in the first place to parse the AndroidManifest.xml (AXML) and after that you can export to a classic XML or do anything else, that's why I don't understand the first request ?
When I looked into it, it looked totally possible to directly parse values out
of the binary AndroidManifest.xml without thinking about XML at all. The most
important frequently used values like packageName, versionCode, version,
etc. should be easiest since they are in the beginning and in the attributes of
the binary <application> tag. If the rapid parsing only worked for the
<application> attributes, that would be a big win. My use case is for rapidly
identifying APKs when there are collections of 100s of thousands or even
millions of APKs. For example, the androzoo APK set is something like 14 million.
Ok I got it now. You want directly to have access to some contents, but the get_apkid from the apk.py is not what you are looking for ? It is just parse the axml quickly and can extract these information without transforming it to an XML
@eighthave ?
Right. Parsing the <application> tag as fast as possible is the most
important. It would be great to have super fast parsing of the whole
AndroidManifest.xml. avast has something similar written in Go. But running
executables in Python can be a pain. That's at https://github.com/avast/apkparser
@obfusk Is it still valid, do you need anything more than what I said in previous comments ?
the title says "faster apk parsing" but from the content I realize we are discussing about faster axml parsing. the structure of the axml is well defined and the way androguard does it at the moment might not be the fastest but it is very robust, in the sense that it covers a lot of edge cases.
having said that, maybe if you explain the use case you are interested in, we can come up with something.
FYI I have created an axml module as part of apkInspector. It will be faster than androguard but it will be more "raw". You can check more details about what is available within it here.
@erev0s https://github.com/androguard/androguard/issues/855#issuecomment-1173561585 is the use case I'm thinking of for this. It would be quite useful for large F-Droid repositories and also people managing large collections of APKs for malware research, etc. I think having a set of rapid parsing functions for key values in the axml. Here are two attempts I made that did have measurable speed improvements:
- https://github.com/f-droid/fdroidserver/blob/83cd04f3b6c80340ba280ad28f1312b0e6389a36/fdroidserver/common.py#L2688
- https://github.com/f-droid/fdroidserver/blob/83cd04f3b6c80340ba280ad28f1312b0e6389a36/fdroidserver/common.py#L2624
For APKs with large manifest files, I think it could be made even faster if it did not read in the whole AndroidManifest.xml file first, but instead parsed the streamed bytes.
I was working on this in my own AXML/ARSC/DEX parsing tools, but since I am no longer working on F-Droid or Android Reproducible Builds my related projects are now on hold indefinitely. Since this discussion is also no longer about androguard I'm closing this.
@erev0s #855 (comment) is the use case I'm thinking of for this. It would be quite useful for large F-Droid repositories and also people managing large collections of APKs for malware research, etc. I think having a set of rapid parsing functions for key values in the axml. Here are two attempts I made that did have measurable speed improvements:
* https://github.com/f-droid/fdroidserver/blob/83cd04f3b6c80340ba280ad28f1312b0e6389a36/fdroidserver/common.py#L2688 * https://github.com/f-droid/fdroidserver/blob/83cd04f3b6c80340ba280ad28f1312b0e6389a36/fdroidserver/common.py#L2624For APKs with large manifest files, I think it could be made even faster if it did not read in the whole AndroidManifest.xml file first, but instead parsed the streamed bytes.
I can see some value in this although it might not be the best use case for androguard. androguard offers the 'full' thing, so changing the way it parses the axml seems like taking steps back. I will keep it in my backlog and see if it might be a better fit to be implemented in apkInspector instead.