pydash icon indicating copy to clipboard operation
pydash copied to clipboard

pydash.get is slow

Open av1m opened this issue 2 years ago • 2 comments

I'm very interested in being able to retrieve data from a list/dictionary. I wanted to have the equivalent of jsonata in Python, but I couldn't find anything. I thus decided to carry out my own function, then I wondered about the performances of the function.

I thus discovered pydash and pydash.get. Suffice to say that I find the project amazing, but when I compared pydash.get with the function I had made, I was left shocked. I also put a comparison in this gist.

I test my code in Python 3.10.1 on macOS m1

av1m avatar Apr 25 '22 15:04 av1m

I took a look at your gist and reproduced similar results locally.

Inspecting pydash.get, I found that your implementation does not handle the same scenarios that pydash does. Some things that pydash.get does differently which makes it slower:

  • Path keys can be like "foo.bar.0" and like "foo.bar[0]".
  • Path key delimiters can be escaped with backslashes like "foo\.bar" to get keys that contain ".".
  • Path keys like "0" will work for both integers and string keys in the target object (e.g. {0: True} and {"0": True} can be accessed with "0" as the path key).
  • In addition to accessing list/dicts, pydash supports namedtuples and class objects (i.e. attribute access).

But ignoring most of the differences, I found that the biggest time sink is in the regular expressions used to by pydash to parse the path keys (i.e. supporting deep access with "items.0"/"items[0]" and backslash-escaping):

https://github.com/dgilland/pydash/blob/24ad0e43b51b367d00447c45baa68c9c03ad1a52/src/pydash/utilities.py#L1265-L1284

That bit of code takes up a large percentage of the overall execution time, but there is a way around that to improve performance without changing anything in pydash: use a list of path keys instead of a string. So instead of pydash.get(data, "0.repo.url") it would be pydash.get(data, [0, "repo", "url"]). That bypasses the regular expression evaluations and helps speed things up significantly.

I also have another library that is a more performant with similar features: fnc (but the argument order is different so it would be fnc.get([0, "repo", "url"], data) instead)

If I update your gist to use the following pydash and fnc implementations:

import pydash
import fnc

def get_pydash(data):
    return [pydash.get(data, [i, "repo", "url"]) for i in range(len(data) - 1)]

def get_fnc(data):
    return [fnc.get([i, "repo", "url"], data) for i in range(len(data) - 1)]

Then I get timing profiles like this:

Execution time for getdeep: 5.847658754999999
Execution time for pydash.get: 16.512301255
Execution time for fnc.get: 8.430758072

Still not as fast as your implementation, but not nearly as bad (with fnc being twice as fast as pydash).

dgilland avatar Apr 26 '22 02:04 dgilland

Thank you for this comprehensive feedback.

Except for the builtin or it's not possible, I usually compare only using lists or only str.

Regarding the use of a list or a str, I have the same constraint. The list is faster and the str must be parsed into a list...

I think it would be interesting to privilege the use of lists (and to mention it in the documentation that the latter is faster).

A possible improvement for pydash.get would be to check the str and try to minimize parsing time. Because, on a basic operation such as "0.repo.url", it shouldn't have a very complex algorithm.

If it helps, you can use the gist implementation.

av1m avatar Apr 26 '22 10:04 av1m

+1 for list of path keys. In JS everything is an object and a numeric key can be indexed with a string or a number. Python doesn't and shouldn't work this way. IMO you should deprecate the string behavior and switch it to List[Union[str, int]] and index clearly so pydash.get(obj, ["key", 0, "item_key"]) or pydash.get(obj, ["key", "0", "item_key"]) reflect exactly what is intended.

drice avatar Apr 28 '23 20:04 drice