cti-python-stix2
cti-python-stix2 copied to clipboard
feat: add 'pretty' parameter to optimize JSON serialization performance
This pull request introduces an optimization to the add method of the FileSystemSink class by allowing the specification of the pretty parameter. The default setting pretty=True in the fp_serialize function negatively impacts the insertion performance of large STIX objects. In some cases, such as XMitreCollection Enterprise ATT&CK, the addition may fail, crashing the thread. Setting pretty=False mitigates this problem, significantly improving performance.
Rationale
By allowing the pretty parameter to be specified in the add method, users can choose to disable "pretty" formatting when saving STIX objects, resulting in significant performance improvements, especially when dealing with large volumes of data.
Performance Tests
- Performance: Insertion with
pretty=Falseis approximately twice as fast compared topretty=True.
Tue Jul 9 20:34:07 2024 ./pretty_true.txt
267530643 function calls (261758598 primitive calls) in 395.561 seconds
Ordered by: internal time
List reduced from 1060 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
157612 118.332 0.001 118.332 0.001 {method 'read' of '_ssl._SSLSocket' objects}
615930 24.570 0.000 25.648 0.000 {built-in method io.open}
557607/185998 18.579 0.000 108.664 0.001 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/base.py:115(__init__)
1231019 13.523 0.000 15.540 0.000 {built-in method posix.stat}
932494 10.714 0.000 10.714 0.000 {method 'write' of '_io.TextIOWrapper' objects}
186047 10.701 0.000 10.701 0.000 {built-in method posix.mkdir}
859476 10.171 0.000 10.719 0.000 {method '__exit__' of '_io._IOBase' objects}
185998 6.705 0.000 22.124 0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/simplejson/encoder.py:306(iterencode)
371996 5.351 0.000 9.555 0.000 /usr/lib/python3.10/_strptime.py:309(_strptime)
391 5.151 0.013 5.151 0.013 {method 'do_handshake' of '_ssl._SSLSocket' objects}
Results with pretty=False:
Tue Jul 9 20:34:15 2024 ./pretty_false.txt
125050663 function calls (118632644 primitive calls) in 117.531 seconds
Ordered by: internal time
List reduced from 1015 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
327985 13.913 0.000 14.160 0.000 {built-in method io.open}
147785 13.138 0.000 13.138 0.000 {method 'read' of '_ssl._SSLSocket' objects}
571531 5.123 0.000 5.692 0.000 {method '__exit__' of '_io._IOBase' objects}
655506 5.101 0.000 5.879 0.000 {built-in method posix.stat}
126659/42220 4.598 0.000 26.854 0.001 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/base.py:115(__init__)
42264 2.481 0.000 2.481 0.000 {built-in method posix.mkdir}
5826222/2237607 2.266 0.000 14.854 0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/simplejson/encoder.py:677(_iterencode)
243585 1.851 0.000 31.258 0.000 /usr/lib/python3.10/zipfile.py:1664(_extract_member)
243856 1.676 0.000 1.676 0.000 {method 'decompress' of 'zlib.Decompress' objects}
1393227/464409 1.654 0.000 4.674 0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/serialization.py:159(find_property_index)
- **Testing with a large dataset: Using the dataset mitre-atlas:
With pretty=True, the process was stopped after 301 seconds, generating 596K of incomplete data.
Wed Jul 10 20:49:26 2024 ./pretty_true_atlas.txt
1644951185 function calls (1349693667 primitive calls) in 301.613 seconds
Ordered by: internal time
List reduced from 971 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
220871233/18172 124.048 0.000 286.845 0.016 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/serialization.py:159(find_property_index)
590821359/590821355 52.875 0.000 105.371 0.000 {built-in method builtins.isinstance}
73587421/18160 42.840 0.000 286.805 0.016 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/serialization.py:137(_find_property_in_seq)
221211256 26.982 0.000 52.497 0.000 /usr/lib/python3.10/abc.py:117(__instancecheck__)
221211256 25.511 0.000 25.514 0.000 {built-in method _abc._abc_instancecheck}
220871233 10.677 0.000 10.677 0.000 {method 'isdigit' of 'str' objects}
73551105 4.198 0.000 4.198 0.000 {method 'values' of 'dict' objects}
924 3.480 0.004 3.480 0.004 {method 'read' of '_ssl._SSLSocket' objects}
52483/1 1.613 0.000 9.912 9.912 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/base.py:115(__init__)
1375284 0.605 0.000 1.170 0.000 /usr/lib/python3.10/collections/__init__.py:1000(__contains__)
With pretty=False, the process completed in 17 seconds, generating 90M of data.
Wed Jul 10 21:21:29 2024 ./pretty_false_atlas.txt
31057698 function calls (30303084 primitive calls) in 17.029 seconds
Ordered by: internal time
List reduced from 979 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
52925/2 1.599 0.000 9.891 4.946 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/base.py:115(__init__)
21995 0.725 0.000 0.736 0.000 {built-in method io.open}
22006 0.643 0.000 0.643 0.000 {built-in method posix.mkdir}
1385596 0.607 0.000 1.174 0.000 /usr/lib/python3.10/collections/__init__.py:1000(__contains__)
21991 0.602 0.000 2.389 0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/simplejson/encoder.py:306(iterencode)
66011 0.535 0.000 0.535 0.000 {built-in method posix.stat}
959 0.523 0.001 0.523 0.001 {method 'read' of '_ssl._SSLSocket' objects}
884643 0.465 0.000 0.796 0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/simplejson/encoder.py:39(encode_basestring)
1385596 0.438 0.000 1.807 0.000 /usr/lib/python3.10/collections/__init__.py:988(get)
44034 0.373 0.000 0.635 0.000 /usr/lib/python3.10/_strptime.py:309(_strptime)
Checklist
- [X] I have signed the Individual Contributor License Agreement (CLA)
- [X] I have read the Contribution Guide
- [X] I ran the unit tests and they all passed
- [X] Added/updated documentation as necessary
Thanks a lot for the contribution.