ish_parser icon indicating copy to clipboard operation
ish_parser copied to clipboard

Performance improvement

Open vtoupet opened this issue 4 years ago • 4 comments

I am using your library with Pandas. Performance is not that good (it takes 1-2 seconds to process a full year). The reasons for this are:

  • operations are performed sequentially while it could be partially vectorised.
  • everyhting is decoded even though you don't need everything

The way I see things:

  • use pandas.read_fwf for the mandatory sections
  • use apply method for the remaining part of the string (additional fields + remarks).

Usually, you know what information you are trying to get (and probably not every field that is present). The idea would be to provide a list of desired fields. Based on that list, we could perform only the necessary decoding and return a Pandas Dataframe (or a list of records)

That would increase speed a lot.

Are you interested in such evolution for your library ?

Thanks, Vincent

vtoupet avatar Jul 22 '21 15:07 vtoupet

i am 100% interested in this :)

haydenth avatar Jul 22 '21 17:07 haydenth

OK Sat and thought about this for a few mins.

  • vectorized == parallized or threaded?
  • love the idea of requesting only specific fields; that would speed it up dramatically

haydenth avatar Jul 22 '21 17:07 haydenth

vectorized means not scalar. Instead of applying a function to a scalar and iterate over a list of scalar, we apply the same function to a vector (of dimension 1 x n). This is the main principle of Numpy and Pandas. This is much quicker.

I'll try to initiate something by september.

vtoupet avatar Jul 23 '21 08:07 vtoupet

Oh I see what you are saying.. I don't think it would be crazy hard to make a layer above ish_report that vectorizes the individual ish_report objects so they can be used in a library like that.

haydenth avatar Jul 23 '21 14:07 haydenth