tbparse icon indicating copy to clipboard operation
tbparse copied to clipboard

Improving Performance

Open j3soon opened this issue 3 years ago • 3 comments

Let's say we have a event file containing 10^6 scalar events:

import os
from torch.utils.tensorboard import SummaryWriter

N_EVENTS = 10 ** 6
log_dir = "./tmp"
writer = SummaryWriter(os.path.join(log_dir, f'run'))
for i in range(N_EVENTS):
    writer.add_scalar('y=2x', i, i)

and compare the loading time between pivot=False and pivot=True:

import time
from tbparse import SummaryReader

def time_tbparse():
    for use_pivot in {False, True}:
        start = time.time()
        reader = SummaryReader("./tmp", pivot=use_pivot)
        df = reader.scalars
        end = time.time()
        print(f"pivot={use_pivot}:", end - start)
time_tbparse()

The results are 11 seconds and 24 seconds respectively on my Intel i7-9700 CPU and Seagate ST8000DM004 HDD. Using pivot=True costs twice the time of pivot=False, and the performance is much worse when parsing multiple files.

If we profile the code with cProfile:

import cProfile
cProfile.run('time_tbparse()')

we can see the results:

         206029117 function calls (191028625 primitive calls) in 66.427 seconds

   Ordered by: standard name
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
        6    0.000    0.000   34.819    5.803 apply.py:143(agg)
        3    0.000    0.000   34.819   11.606 apply.py:308(agg_list_like)
...
  3000000    5.838    0.000   24.541    0.000 summary_reader.py:209(_merge_values)
      6/2    0.001    0.000   35.408   17.704 summary_reader.py:237(get_events)
...
        2    0.001    0.001   35.409   17.705 summary_reader.py:304(scalars)
      6/2    0.169    0.028   31.403   15.701 summary_reader.py:61(__init__)
...

The bottleneck is in located in the _merge_values function called here, which is not executed when pivot=False.

I believe the _merge_values function can be optimized to improve the performance when using pivot=True.

Moreover, it would be nice to provide some benchmarks and document the performance analysis in the README file, which will be useful for future optimizations.

j3soon avatar Apr 21 '22 02:04 j3soon

The performance is slightly improved in commit 5d69fa1214b10cd36c02d82d59e2a4f6390941c1 and 4bd87404040fd85bdd72b89fefd21e9c6486d26a. Several benchmarks are provided in tbparse/profiling.

To further accelerate the parsing process, there are two potential solutions: Numba (supported by pandas) and cuDF.

For parsing single event files, the bottleneck is located in get_cols(...) and grouped.aggregate(self._merge_values).

  • Accelerating _merge_values with Numba is not straightforward due to the object data type and the unknown length of the outcome results.
  • As for get_cols(...), we know the number of rows/columns and the data type beforehand (based on the tensorboard event data). Therefore, it's possible to replace the list as numpy arrays with fixed length and non-object data type.

So the next step is to re-write the get_cols(...) functions in numpy array style and provide an option to allow Numba to JIT-compile these functions.

Update (2022/11/17): Similar to Numba, cuDF also does not support the object data type as mentioned here.

j3soon avatar May 18 '22 14:05 j3soon

When parsing many event files inside a deep filesystem hierarchy, the parsing speed might be very slow.

This is due to the use of a recursive tree parsing logic (bad design) to combine the DataFrames constructed in each subroutines, making the worst time complexity $O(n^2)$ for $n$ files.

The solution to this is to remove the recursive parsing logic and combine all DataFrames at once, improving the worst time complexity to $O(n)$.

j3soon avatar Aug 06 '22 16:08 j3soon