Improving Performance
Let's say we have a event file containing 10^6 scalar events:
import os
from torch.utils.tensorboard import SummaryWriter
N_EVENTS = 10 ** 6
log_dir = "./tmp"
writer = SummaryWriter(os.path.join(log_dir, f'run'))
for i in range(N_EVENTS):
writer.add_scalar('y=2x', i, i)
and compare the loading time between pivot=False and pivot=True:
import time
from tbparse import SummaryReader
def time_tbparse():
for use_pivot in {False, True}:
start = time.time()
reader = SummaryReader("./tmp", pivot=use_pivot)
df = reader.scalars
end = time.time()
print(f"pivot={use_pivot}:", end - start)
time_tbparse()
The results are 11 seconds and 24 seconds respectively on my Intel i7-9700 CPU and Seagate ST8000DM004 HDD. Using pivot=True costs twice the time of pivot=False, and the performance is much worse when parsing multiple files.
If we profile the code with cProfile:
import cProfile
cProfile.run('time_tbparse()')
we can see the results:
206029117 function calls (191028625 primitive calls) in 66.427 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
...
6 0.000 0.000 34.819 5.803 apply.py:143(agg)
3 0.000 0.000 34.819 11.606 apply.py:308(agg_list_like)
...
3000000 5.838 0.000 24.541 0.000 summary_reader.py:209(_merge_values)
6/2 0.001 0.000 35.408 17.704 summary_reader.py:237(get_events)
...
2 0.001 0.001 35.409 17.705 summary_reader.py:304(scalars)
6/2 0.169 0.028 31.403 15.701 summary_reader.py:61(__init__)
...
The bottleneck is in located in the _merge_values function called here, which is not executed when pivot=False.
I believe the _merge_values function can be optimized to improve the performance when using pivot=True.
Moreover, it would be nice to provide some benchmarks and document the performance analysis in the README file, which will be useful for future optimizations.
The performance is slightly improved in commit 5d69fa1214b10cd36c02d82d59e2a4f6390941c1 and 4bd87404040fd85bdd72b89fefd21e9c6486d26a. Several benchmarks are provided in tbparse/profiling.
To further accelerate the parsing process, there are two potential solutions: Numba (supported by pandas) and cuDF.
For parsing single event files, the bottleneck is located in get_cols(...) and grouped.aggregate(self._merge_values).
- Accelerating
_merge_valueswith Numba is not straightforward due to theobjectdata type and the unknown length of the outcome results. - As for
get_cols(...), we know the number of rows/columns and the data type beforehand (based on the tensorboard event data). Therefore, it's possible to replace thelistasnumpyarrays with fixed length and non-object data type.
So the next step is to re-write the get_cols(...) functions in numpy array style and provide an option to allow Numba to JIT-compile these functions.
Update (2022/11/17): Similar to Numba, cuDF also does not support the object data type as mentioned here.
When parsing many event files inside a deep filesystem hierarchy, the parsing speed might be very slow.
This is due to the use of a recursive tree parsing logic (bad design) to combine the DataFrames constructed in each subroutines, making the worst time complexity $O(n^2)$ for $n$ files.
The solution to this is to remove the recursive parsing logic and combine all DataFrames at once, improving the worst time complexity to $O(n)$.