mitreattack-python icon indicating copy to clipboard operation
mitreattack-python copied to clipboard

[Request] Can you add attackToExcel.get_stix_data_from( "/path/to/export/folder") to make loading data much faster? Or some other more efficient cache file format?

Open jt0dd opened this issue 2 years ago • 1 comments

Is your feature request related to a problem?

The example from the usage page we've been using takes an extremely long time to load.

Describe the solution you'd like

Just make it a little more clear (in the basic usage example) how we can not only export, but cache and import the att&ck matrix data rather than slowly loading it.

Describe alternatives you've considered

There doesn't seem to be one since the documentation only mentions an export feature, not import.

Additional context

import mitreattack.attackToExcel.attackToExcel as attackToExcel
import mitreattack.attackToExcel.stixToDf as stixToDf

# download and parse ATT&CK STIX data

# SUGGESTED ADDITION / PSEUDO CODE:
attackToExcel.export("enterprise-attack", "v8.1", "/path/to/export/folder")
# instead of:
# attackdata = attackToExcel.get_stix_data("enterprise-attack")
# allow:
attackdata = attackToExcel.get_stix_data_from( "/path/to/export/folder")
# END ADDITION

# get Pandas DataFrames for techniques, associated relationships, and citations
techniques_data = stixToDf.techniquesToDf(attackdata, "enterprise-attack") 

# show T1102 and sub-techniques of T1102
techniques_df = techniques_data["techniques"]
print(techniques_df[techniques_df["ID"].str.contains("T1102")]["name"])

And I don't really know if exporting as excel is the most efficient way to cache the data, probably not, but it seems to be the format supported. My only goal is to get the data into a DataFrame as efficiently as possible instead of having to go take a 5 minute coffee break to wait every time I restart my Jupyter kernel.

We're going to be solving this by adding some code to use Apache's Parquet to store the DataFrame efficiently, but that is not something that would make sense as a PR in a library designed for converting to Excel. That said, people shouldn't need to invent a caching solution for this, in my opinion. It would make sense to support it by default when the library takes 3-5 minutes to load into a DataFrame.

Like I said, I don't know if it really fits into the library since it's named to be an excel conversion tool, but I'm thinking something like:

attackToExcel.export_parquet("enterprise-attack", "v8.1", "/path/to/export/file")
attackdata = attackToExcel.import_parquet("/path/to/export/file")
techniques_data = stixToDf.techniquesToDf(attackdata, "enterprise-attack")

jt0dd avatar Apr 28 '22 05:04 jt0dd

Oh I should've suggested, I guess, Python's pickling feature rather than Parquet, which is more optimal for very large and diverse data structures. I only have a year of Python experience, so I'd forgotten it was the right option here. Nonetheless, I still think caching the data in a file should be a built-in option in the library rather than the user needing to do it manually. I could understand if the maintainers of the project feel differently; it's not that hard to cache with pandas.DataFrame.to_pickle. Just my suggestion / opinion.

jt0dd avatar May 02 '22 23:05 jt0dd