cti-python-stix2 icon indicating copy to clipboard operation
cti-python-stix2 copied to clipboard

feat: add 'pretty' parameter to optimize JSON serialization performance

Open PedroHenriqueFernandes opened this issue 1 year ago • 1 comments

This pull request introduces an optimization to the add method of the FileSystemSink class by allowing the specification of the pretty parameter. The default setting pretty=True in the fp_serialize function negatively impacts the insertion performance of large STIX objects. In some cases, such as XMitreCollection Enterprise ATT&CK, the addition may fail, crashing the thread. Setting pretty=False mitigates this problem, significantly improving performance.

Rationale

By allowing the pretty parameter to be specified in the add method, users can choose to disable "pretty" formatting when saving STIX objects, resulting in significant performance improvements, especially when dealing with large volumes of data.

Performance Tests

  1. Performance: Insertion with pretty=False is approximately twice as fast compared to pretty=True.
Tue Jul  9 20:34:07 2024    ./pretty_true.txt

         267530643 function calls (261758598 primitive calls) in 395.561 seconds

   Ordered by: internal time
   List reduced from 1060 to 10 due to restriction <10>

   ncalls       tottime  percall  cumtime  percall filename:lineno(function)
   157612       118.332    0.001  118.332    0.001 {method 'read' of '_ssl._SSLSocket' objects}
   615930       24.570    0.000   25.648    0.000 {built-in method io.open}
557607/185998   18.579    0.000  108.664    0.001 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/base.py:115(__init__)
  1231019       13.523    0.000   15.540    0.000 {built-in method posix.stat}
   932494       10.714    0.000   10.714    0.000 {method 'write' of '_io.TextIOWrapper' objects}
   186047       10.701    0.000   10.701    0.000 {built-in method posix.mkdir}
   859476       10.171    0.000   10.719    0.000 {method '__exit__' of '_io._IOBase' objects}
   185998       6.705    0.000   22.124    0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/simplejson/encoder.py:306(iterencode)
   371996       5.351    0.000    9.555    0.000 /usr/lib/python3.10/_strptime.py:309(_strptime)
      391       5.151    0.013    5.151    0.013 {method 'do_handshake' of '_ssl._SSLSocket' objects}

Results with pretty=False:

Tue Jul  9 20:34:15 2024    ./pretty_false.txt

         125050663 function calls (118632644 primitive calls) in 117.531 seconds

   Ordered by: internal time
   List reduced from 1015 to 10 due to restriction <10>

   ncalls       tottime  percall  cumtime  percall filename:lineno(function)
   327985       13.913    0.000   14.160    0.000 {built-in method io.open}
   147785       13.138    0.000   13.138    0.000 {method 'read' of '_ssl._SSLSocket' objects}
   571531       5.123    0.000    5.692    0.000 {method '__exit__' of '_io._IOBase' objects}
   655506       5.101    0.000    5.879    0.000 {built-in method posix.stat}
126659/42220    4.598    0.000   26.854    0.001 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/base.py:115(__init__)
    42264       2.481    0.000    2.481    0.000 {built-in method posix.mkdir}
5826222/2237607 2.266    0.000   14.854    0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/simplejson/encoder.py:677(_iterencode)
   243585       1.851    0.000   31.258    0.000 /usr/lib/python3.10/zipfile.py:1664(_extract_member)
   243856       1.676    0.000    1.676    0.000 {method 'decompress' of 'zlib.Decompress' objects}
1393227/464409  1.654    0.000    4.674    0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/serialization.py:159(find_property_index)
  1. **Testing with a large dataset: Using the dataset mitre-atlas:

With pretty=True, the process was stopped after 301 seconds, generating 596K of incomplete data.

Wed Jul 10 20:49:26 2024    ./pretty_true_atlas.txt

         1644951185 function calls (1349693667 primitive calls) in 301.613 seconds

   Ordered by: internal time
   List reduced from 971 to 10 due to restriction <10>

   ncalls           tottime  percall  cumtime  percall filename:lineno(function)
220871233/18172     124.048    0.000  286.845    0.016 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/serialization.py:159(find_property_index)
590821359/590821355 52.875    0.000  105.371    0.000 {built-in method builtins.isinstance}
73587421/18160      42.840    0.000  286.805    0.016 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/serialization.py:137(_find_property_in_seq)
221211256           26.982    0.000   52.497    0.000 /usr/lib/python3.10/abc.py:117(__instancecheck__)
221211256           25.511    0.000   25.514    0.000 {built-in method _abc._abc_instancecheck}
220871233           10.677    0.000   10.677    0.000 {method 'isdigit' of 'str' objects}
 73551105           4.198    0.000    4.198    0.000 {method 'values' of 'dict' objects}
      924           3.480    0.004    3.480    0.004 {method 'read' of '_ssl._SSLSocket' objects}
  52483/1           1.613    0.000    9.912    9.912 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/base.py:115(__init__)
  1375284           0.605    0.000    1.170    0.000 /usr/lib/python3.10/collections/__init__.py:1000(__contains__)

With pretty=False, the process completed in 17 seconds, generating 90M of data.

Wed Jul 10 21:21:29 2024    ./pretty_false_atlas.txt

         31057698 function calls (30303084 primitive calls) in 17.029 seconds

   Ordered by: internal time
   List reduced from 979 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  52925/2    1.599    0.000    9.891    4.946 /openstix-python/.venv/default/lib/python3.10/site-packages/stix2/base.py:115(__init__)
    21995    0.725    0.000    0.736    0.000 {built-in method io.open}
    22006    0.643    0.000    0.643    0.000 {built-in method posix.mkdir}
  1385596    0.607    0.000    1.174    0.000 /usr/lib/python3.10/collections/__init__.py:1000(__contains__)
    21991    0.602    0.000    2.389    0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/simplejson/encoder.py:306(iterencode)
    66011    0.535    0.000    0.535    0.000 {built-in method posix.stat}
      959    0.523    0.001    0.523    0.001 {method 'read' of '_ssl._SSLSocket' objects}
   884643    0.465    0.000    0.796    0.000 /openstix-python/.venv/default/lib/python3.10/site-packages/simplejson/encoder.py:39(encode_basestring)
  1385596    0.438    0.000    1.807    0.000 /usr/lib/python3.10/collections/__init__.py:988(get)
    44034    0.373    0.000    0.635    0.000 /usr/lib/python3.10/_strptime.py:309(_strptime)

Checklist

  • [X] I have signed the Individual Contributor License Agreement (CLA)
  • [X] I have read the Contribution Guide
  • [X] I ran the unit tests and they all passed
  • [X] Added/updated documentation as necessary

PedroHenriqueFernandes avatar Jul 11 '24 14:07 PedroHenriqueFernandes

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Jul 11 '24 14:07 CLAassistant

Thanks a lot for the contribution.

adulau avatar Oct 15 '24 09:10 adulau