filesystem_spec icon indicating copy to clipboard operation
filesystem_spec copied to clipboard

Proposal to change implementation of globbing for possible performance increase

Open tfelbr opened this issue 1 year ago • 9 comments

Hello, I am using the build-in SFTP implementation to access an SFTP server. Unfortunately, this server is pretty slow, so I used globbing with the hope it will speed up things.

The files on this server are ordered in different subdirectories. Only files in all subdirs that start with an "A" are relevant to me. The url I tried does look like this:

sftp://user:password@sftp_host/root/path/A*/*.zip

The outcome was as I expected, only files inside the dirs that start with an "A" were returned. But the performance was still bad so I decided to debug the whole thing and discovered that the filesystem still lists all subdirectories regardless of their name. I had a look into the implementation of the glob() function in the AbstractFileSystem and believe that these lines are responsible for that:

allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
# Escape characters special to python regex, leaving our supported
# special characters in place.
# See https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
# for shell globbing details.
pattern = (
    "^"
    + (
        path.replace("\\", r"\\")
        .replace(".", r"\.")
        .replace("+", r"\+")
        .replace("//", "/")
        .replace("(", r"\(")
        .replace(")", r"\)")
        .replace("|", r"\|")
        .replace("^", r"\^")
        .replace("$", r"\$")
        .replace("{", r"\{")
        .replace("}", r"\}")
        .rstrip("/")
        .replace("?", ".")
    )
    + "$"
)
pattern = re.sub("/[*]{2}", "=SLASH_DOUBLE_STARS=", pattern)
pattern = re.sub("[*]{2}/?", "=DOUBLE_STARS=", pattern)
pattern = re.sub("[*]", "[^/]*", pattern)
pattern = re.sub("=SLASH_DOUBLE_STARS=", "(|/.*)", pattern)
pattern = re.sub("=DOUBLE_STARS=", ".*", pattern)
pattern = re.compile(pattern)

out = {
    p: allpaths[p]
    for p in sorted(allpaths)
    if pattern.match(p.replace("//", "/").rstrip("/"))
}

It seems the filesystem first looks up all files and filters them afterwards. While this certainly delivers the expected results it is still slow as the server has to list all directories nevertheless.

I would like to propose to change the implementation and make it more performant, not only for SFTP. My first naive idea is to expand the existing find() and walk() functions (or add new ones) with the ability to apply regex filters while traversing the tree. Therefore the globbed path could be split up and a regex could be created for every subdirectory, as it is the case right now with the complete path. Then this list of regex strings could be used to filter the path of every subdirectory.

Please tell me what you think and if this could be a possible solution. In case I missed or overlooked something feel free to correct me!

tfelbr avatar Sep 08 '23 13:09 tfelbr

Ref: https://github.com/fsspec/filesystem_spec/pull/1263

martindurant avatar Sep 08 '23 13:09 martindurant

One thing to note, is that in remote blob filesystems, what counts is the number of calls, and find() is often one-shot and the fastest way to proceed, as opposed to listing each level and possible sublevel. However, when the code goes explicitly through walk, you have a good point.

I linked a recent PR which allows for the directories to walk to be edited by the caller during iteration if in top-down (depth-first) mode, and this could handily solve the case for you.

martindurant avatar Sep 08 '23 13:09 martindurant

Sorry for my late reply, there were some external circumstances that prevented me from responding.

So thank you at first for your answer! That is actually a really practical and good to know feature, but I'm still not sure if that can help me in my case. I am using the fsspec.open_files() function which in turn will call glob() and then find(). find() by itself, in its base implementation, propagates this call to walk(). As I understand it, neither the call to open_files() nor glob() nor find() lets me control the iteration of walk() and modify the list of dirnames. I would have to call walk() manually but then I would lose all the features the other methods and functions provide, especially glob().

Thank you for your advice again, and again please correct me if I missed anything here.

tfelbr avatar Sep 22 '23 12:09 tfelbr

I see what you are saying.

I don't quite picture how you might change the API in open_files and all the way down to allow filtering/pruning of the filesystem walk. Indeed you could use the existing way I mention above to get listings only in parts of the tree you require, and extract a list of matching paths to pass to open_files; the actual pattern matching in glob is pretty simple.

martindurant avatar Sep 25 '23 14:09 martindurant

hi, dont know whether this issue is the same as what you raise above.

I'm experiencing a performance problem when using a glob pattern like hdfs:///user/hive/warehouse/db/table/part_date=20240802*/*.parquet in the HDFS filesystem. It either hangs or takes 10+ minutes to return results. However, using the same pattern with the hdfs dfs -ls command returns results in just a few seconds.

Interestingly, when I use fsspec.glob with path hdfs:///user/hive/wareshoue/db/table/part_date=20240802*, the result returned in just a few secs

%timeit -r3
fs.glob("hdfs:///user/hive/warehouse/db/table/part_date=20240822*")

['/user/hive/warehouse/db/table/part_date=2024082200', '/user/hive/warehouse/db/table/part_date=2024082201', '/user/hive/warehouse/db/table/part_date=2024082202', '/user/hive/warehouse/db/table/part_date=2024082203']

2.79 s ± 98.3 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
  1. It seems like fsspec glob might be walking through all files in the root path hdfs:///user/hive/warehouse/db/table to fetch files, and then it tries to match the pattern. Is that correct?
  2. do you have any tips to increase performance in this case?

thanks

ntnhaatj avatar Aug 25 '24 06:08 ntnhaatj

You should be able to turn on logging to see what exact operations are being performed: are directories excluded by the upper part of the path being listed?

It's worth noting that HDFS support is provided via pyarrow; we could intercept glob and imrpove it, but as things are, this is not really calling fsspec code (I think).

martindurant avatar Aug 26 '24 13:08 martindurant

@martindurant I ran profiling the fs.glob function on my HDFS directory

# 1
profiled_glob("hdfs:///user/hive/warehouse/a.db/table/part_date=20240825*")
# 2
profiled_glob("hdfs:///user/hive/warehouse/a.db/table/part_date=20240825*/*.c000")

in both cases, the result always take most of the time at line which returned allpaths under root

allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
  • case 1
Timer unit: 1e-09 s

Total time: 2.71698 s
File: /home/n/venv/lib/python3.9/site-packages/fsspec/spec.py
Function: glob at line 545

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   545                                               def glob(self, path, maxdepth=None, **kwargs):
   546                                                   """
   547                                                   Find files by glob-matching.
   548                                           
   549                                                   If the path ends with '/', only folders are returned.
   550                                           
   551                                                   We support ``"**"``,
   552                                                   ``"?"`` and ``"[..]"``. We do not support ^ for pattern negation.
   553                                           
   554                                                   The `maxdepth` option is applied on the first `**` found in the path.
   555                                           
   556                                                   kwargs are passed to ``ls``.
   557                                                   """
   558         1       1552.0   1552.0      0.0          if maxdepth is not None and maxdepth < 1:
   559                                                       raise ValueError("maxdepth must be at least 1")
   560                                           
   561         1       2433.0   2433.0      0.0          import re
   562                                           
   563         1       1748.0   1748.0      0.0          seps = (os.path.sep, os.path.altsep) if os.path.altsep else (os.path.sep,)
   564         1       2089.0   2089.0      0.0          ends_with_sep = path.endswith(seps)  # _strip_protocol strips trailing slash
   565         1     145822.0 145822.0      0.0          path = self._strip_protocol(path)
   566         2       3828.0   1914.0      0.0          append_slash_to_dirname = ends_with_sep or path.endswith(
   567         1      10524.0  10524.0      0.0              tuple(sep + "**" for sep in seps)
   568                                                   )
   569         1       2993.0   2993.0      0.0          idx_star = path.find("*") if path.find("*") >= 0 else len(path)
   570         1       2135.0   2135.0      0.0          idx_qmark = path.find("?") if path.find("?") >= 0 else len(path)
   571         1       1081.0   1081.0      0.0          idx_brace = path.find("[") if path.find("[") >= 0 else len(path)
   572                                           
   573         1       2500.0   2500.0      0.0          min_idx = min(idx_star, idx_qmark, idx_brace)
   574                                           
   575         1       1193.0   1193.0      0.0          detail = kwargs.pop("detail", False)
   576                                           
   577         1      11975.0  11975.0      0.0          if not has_magic(path):
   578                                                       if self.exists(path, **kwargs):
   579                                                           if not detail:
   580                                                               return [path]
   581                                                           else:
   582                                                               return {path: self.info(path, **kwargs)}
   583                                                       else:
   584                                                           if not detail:
   585                                                               return []  # glob of non-existent returns empty
   586                                                           else:
   587                                                               return {}
   588         1       1666.0   1666.0      0.0          elif "/" in path[:min_idx]:
   589         1       2499.0   2499.0      0.0              min_idx = path[:min_idx].rindex("/")
   590         1        831.0    831.0      0.0              root = path[: min_idx + 1]
   591         1       2788.0   2788.0      0.0              depth = path[min_idx + 1 :].count("/") + 1
   592                                                   else:
   593                                                       root = ""
   594                                                       depth = path[min_idx + 1 :].count("/") + 1
   595                                           
   596         1        793.0    793.0      0.0          if "**" in path:
   597                                                       if maxdepth is not None:
   598                                                           idx_double_stars = path.find("**")
   599                                                           depth_double_stars = path[idx_double_stars:].count("/") + 1
   600                                                           depth = depth - depth_double_stars + maxdepth
   601                                                       else:
   602                                                           depth = None
   603                                           
   604         1 2696123724.0    3e+09     99.2          allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
   605                                           
   606         1     285860.0 285860.0      0.0          pattern = glob_translate(path + ("/" if ends_with_sep else ""))
   607         1       5354.0   5354.0      0.0          pattern = re.compile(pattern)
   608                                           
   609         2   17518646.0    9e+06      0.6          out = {
   610                                                       p: info
   611         1    2848568.0    3e+06      0.1              for p, info in sorted(allpaths.items())
   612                                                       if pattern.match(
   613                                                           (
   614                                                               p + "/"
   615                                                               if append_slash_to_dirname and info["type"] == "directory"
   616                                                               else p
   617                                                           )
   618                                                       )
   619                                                   }
   620                                           
   621         1       1150.0   1150.0      0.0          if detail:
   622                                                       return out
   623                                                   else:
   624         1       3190.0   3190.0      0.0              return list(out)

  • case 2: hang up since a seems take a lot of time to finish (~50 mins)
Timer unit: 1e-09 s

Total time: 44.7239 s
File: /home/n/venv/lib/python3.9/site-packages/fsspec/spec.py
Function: glob at line 545

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   545                                               def glob(self, path, maxdepth=None, **kwargs):
   546                                                   """
   547                                                   Find files by glob-matching.
   548                                           
   549                                                   If the path ends with '/', only folders are returned.
   550                                           
   551                                                   We support ``"**"``,
   552                                                   ``"?"`` and ``"[..]"``. We do not support ^ for pattern negation.
   553                                           
   554                                                   The `maxdepth` option is applied on the first `**` found in the path.
   555                                           
   556                                                   kwargs are passed to ``ls``.
   557                                                   """
   558         1       1061.0   1061.0      0.0          if maxdepth is not None and maxdepth < 1:
   559                                                       raise ValueError("maxdepth must be at least 1")
   560                                           
   561         1       1657.0   1657.0      0.0          import re
   562                                           
   563         1       1552.0   1552.0      0.0          seps = (os.path.sep, os.path.altsep) if os.path.altsep else (os.path.sep,)
   564         1       1239.0   1239.0      0.0          ends_with_sep = path.endswith(seps)  # _strip_protocol strips trailing slash
   565         1      49725.0  49725.0      0.0          path = self._strip_protocol(path)
   566         2       1632.0    816.0      0.0          append_slash_to_dirname = ends_with_sep or path.endswith(
   567         1       3535.0   3535.0      0.0              tuple(sep + "**" for sep in seps)
   568                                                   )
   569         1        955.0    955.0      0.0          idx_star = path.find("*") if path.find("*") >= 0 else len(path)
   570         1        731.0    731.0      0.0          idx_qmark = path.find("?") if path.find("?") >= 0 else len(path)
   571         1        444.0    444.0      0.0          idx_brace = path.find("[") if path.find("[") >= 0 else len(path)
   572                                           
   573         1        739.0    739.0      0.0          min_idx = min(idx_star, idx_qmark, idx_brace)
   574                                           
   575         1        504.0    504.0      0.0          detail = kwargs.pop("detail", False)
   576                                           
   577         1       5415.0   5415.0      0.0          if not has_magic(path):
   578                                                       if self.exists(path, **kwargs):
   579                                                           if not detail:
   580                                                               return [path]
   581                                                           else:
   582                                                               return {path: self.info(path, **kwargs)}
   583                                                       else:
   584                                                           if not detail:
   585                                                               return []  # glob of non-existent returns empty
   586                                                           else:
   587                                                               return {}
   588         1        635.0    635.0      0.0          elif "/" in path[:min_idx]:
   589         1       1503.0   1503.0      0.0              min_idx = path[:min_idx].rindex("/")
   590         1        480.0    480.0      0.0              root = path[: min_idx + 1]
   591         1       1396.0   1396.0      0.0              depth = path[min_idx + 1 :].count("/") + 1
   592                                                   else:
   593                                                       root = ""
   594                                                       depth = path[min_idx + 1 :].count("/") + 1
   595                                           
   596         1        429.0    429.0      0.0          if "**" in path:
   597                                                       if maxdepth is not None:
   598                                                           idx_double_stars = path.find("**")
   599                                                           depth_double_stars = path[idx_double_stars:].count("/") + 1
   600                                                           depth = depth - depth_double_stars + maxdepth
   601                                                       else:
   602                                                           depth = None
   603         1      63764.0  63764.0      0.0          print(kwargs)
   604         1        4e+10    4e+10    100.0          allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
   605                                                   pattern = glob_translate(path + ("/" if ends_with_sep else ""))
   606                                                   pattern = re.compile(pattern)
   607                                           
   608                                                   out = {
   609                                                       p: info
   610                                                       for p, info in sorted(allpaths.items())
   611                                                       if pattern.match(
   612                                                           (
   613                                                               p + "/"
   614                                                               if append_slash_to_dirname and info["type"] == "directory"
   615                                                               else p
   616                                                           )
   617                                                       )
   618                                                   }
   619                                           
   620                                                   if detail:
   621                                                       return out
   622                                                   else:
   623                                                       return list(out)

Total time: 44.7239 s
File: /tmp/ipykernel_7873/3826323558.py
Function: profiled_glob at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================

it seems that with the current glob implementation, it tried to get all paths under root then matching pattern at client side, can we have an option to push down the glob filter to the filesystem to boost performance?

ntnhaatj avatar Aug 28 '24 07:08 ntnhaatj

It would appear that find on HDFS is not respecting the depth parameter. The first of your calls should only be effectively a single call to ls, and the second a number of additional calls to the first level subdirectories. Unbounded find should only be called when the pattern contains "**". Of course, find can be implemented with ls/walk, so it woudl be worthwhile finding out exactly what is getting called here.

martindurant avatar Aug 28 '24 14:08 martindurant

@martindurant I see, my usage of fsspec is within DuckDB to interact with HDFS, which implicitly calls glob to scan directories. Unfortunately, the performance isn't good, likely due to reasons related to the large number of files, partitions, and HDFS namenode limitations.

ntnhaatj avatar Aug 31 '24 09:08 ntnhaatj