Performance improvement
I am using your library with Pandas. Performance is not that good (it takes 1-2 seconds to process a full year). The reasons for this are:
- operations are performed sequentially while it could be partially vectorised.
- everyhting is decoded even though you don't need everything
The way I see things:
- use pandas.read_fwf for the mandatory sections
- use apply method for the remaining part of the string (additional fields + remarks).
Usually, you know what information you are trying to get (and probably not every field that is present). The idea would be to provide a list of desired fields. Based on that list, we could perform only the necessary decoding and return a Pandas Dataframe (or a list of records)
That would increase speed a lot.
Are you interested in such evolution for your library ?
Thanks, Vincent
i am 100% interested in this :)
OK Sat and thought about this for a few mins.
- vectorized == parallized or threaded?
- love the idea of requesting only specific fields; that would speed it up dramatically
vectorized means not scalar. Instead of applying a function to a scalar and iterate over a list of scalar, we apply the same function to a vector (of dimension 1 x n). This is the main principle of Numpy and Pandas. This is much quicker.
I'll try to initiate something by september.
Oh I see what you are saying.. I don't think it would be crazy hard to make a layer above ish_report that vectorizes the individual ish_report objects so they can be used in a library like that.