androguard icon indicating copy to clipboard operation
androguard copied to clipboard

faster APK parsing?

Open obfusk opened this issue 3 years ago • 13 comments

I've briefly looked at faster parsing for fdroidserver.

This seems like something that could be useful to add directly to androguard instead of only using it in fdroidserver.

I'd be happy to (clean up that code a bit and) draft a PR if you'd like.

obfusk avatar Apr 17 '21 13:04 obfusk

cc @eighthave

obfusk avatar Apr 17 '21 13:04 obfusk

yeah that would be good to have here. androguard is widely used and forked, I think it currently not actively developed though, so merge requests could be a slow process.

eighthave avatar Apr 20 '21 09:04 eighthave

Is still interested ? I'm back on it

totoag avatar Jun 30 '22 19:06 totoag

yes definitely! A pure python fast APK parser would be great to have. I think specifically directly parsing the binary AndroidManifest.xml without first converting to XML would be the biggest win

eighthave avatar Jul 01 '22 09:07 eighthave

pyaxmlparser is a project which does it already - https://github.com/appknox/pyaxmlparser

I think fdroid already uses it

subho007 avatar Jul 03 '22 07:07 subho007

Not sure what is the difference from pyaxmlparser and the current code ? As it seems pyaxmlparser is using my code.

@eighthave @subho007 what are the improvements from fdroidserver ?

totoag avatar Jul 03 '22 08:07 totoag

doesn't pyxmlparser convert to XML first? Or does it have a way to directly read values from the binary without the intermediate XML conversion?

eighthave avatar Jul 03 '22 18:07 eighthave

Uhm no you need in the first place to parse the AndroidManifest.xml (AXML) and after that you can export to a classic XML or do anything else, that's why I don't understand the first request ?

totoag avatar Jul 04 '22 06:07 totoag

When I looked into it, it looked totally possible to directly parse values out of the binary AndroidManifest.xml without thinking about XML at all. The most important frequently used values like packageName, versionCode, version, etc. should be easiest since they are in the beginning and in the attributes of the binary <application> tag. If the rapid parsing only worked for the <application> attributes, that would be a big win. My use case is for rapidly identifying APKs when there are collections of 100s of thousands or even millions of APKs. For example, the androzoo APK set is something like 14 million.

eighthave avatar Jul 04 '22 09:07 eighthave

Ok I got it now. You want directly to have access to some contents, but the get_apkid from the apk.py is not what you are looking for ? It is just parse the axml quickly and can extract these information without transforming it to an XML

totoag avatar Jul 04 '22 09:07 totoag

@eighthave ?

totoag avatar Jul 07 '22 07:07 totoag

Right. Parsing the <application> tag as fast as possible is the most important. It would be great to have super fast parsing of the whole AndroidManifest.xml. avast has something similar written in Go. But running executables in Python can be a pain. That's at https://github.com/avast/apkparser

eighthave avatar Jul 07 '22 09:07 eighthave

@obfusk Is it still valid, do you need anything more than what I said in previous comments ?

totoag avatar Jul 07 '22 12:07 totoag

the title says "faster apk parsing" but from the content I realize we are discussing about faster axml parsing. the structure of the axml is well defined and the way androguard does it at the moment might not be the fastest but it is very robust, in the sense that it covers a lot of edge cases.

having said that, maybe if you explain the use case you are interested in, we can come up with something.

FYI I have created an axml module as part of apkInspector. It will be faster than androguard but it will be more "raw". You can check more details about what is available within it here.

erev0s avatar Dec 13 '23 23:12 erev0s

@erev0s https://github.com/androguard/androguard/issues/855#issuecomment-1173561585 is the use case I'm thinking of for this. It would be quite useful for large F-Droid repositories and also people managing large collections of APKs for malware research, etc. I think having a set of rapid parsing functions for key values in the axml. Here are two attempts I made that did have measurable speed improvements:

  • https://github.com/f-droid/fdroidserver/blob/83cd04f3b6c80340ba280ad28f1312b0e6389a36/fdroidserver/common.py#L2688
  • https://github.com/f-droid/fdroidserver/blob/83cd04f3b6c80340ba280ad28f1312b0e6389a36/fdroidserver/common.py#L2624

For APKs with large manifest files, I think it could be made even faster if it did not read in the whole AndroidManifest.xml file first, but instead parsed the streamed bytes.

eighthave avatar Jan 04 '24 11:01 eighthave

I was working on this in my own AXML/ARSC/DEX parsing tools, but since I am no longer working on F-Droid or Android Reproducible Builds my related projects are now on hold indefinitely. Since this discussion is also no longer about androguard I'm closing this.

obfusk avatar Jan 04 '24 15:01 obfusk

@erev0s #855 (comment) is the use case I'm thinking of for this. It would be quite useful for large F-Droid repositories and also people managing large collections of APKs for malware research, etc. I think having a set of rapid parsing functions for key values in the axml. Here are two attempts I made that did have measurable speed improvements:

* https://github.com/f-droid/fdroidserver/blob/83cd04f3b6c80340ba280ad28f1312b0e6389a36/fdroidserver/common.py#L2688

* https://github.com/f-droid/fdroidserver/blob/83cd04f3b6c80340ba280ad28f1312b0e6389a36/fdroidserver/common.py#L2624

For APKs with large manifest files, I think it could be made even faster if it did not read in the whole AndroidManifest.xml file first, but instead parsed the streamed bytes.

I can see some value in this although it might not be the best use case for androguard. androguard offers the 'full' thing, so changing the way it parses the axml seems like taking steps back. I will keep it in my backlog and see if it might be a better fit to be implemented in apkInspector instead.

erev0s avatar Jan 04 '24 22:01 erev0s