beets icon indicating copy to clipboard operation
beets copied to clipboard

Use pathlib everywhere in beets

Open brunal opened this issue 9 years ago • 16 comments

Pathlib is awesome. It abstracts path the platform for path manipulation. It would allow us to have more robust code and get rid to all the calls to beets.util.*_path() functions. It could be extended to provide desired repr().

I feel like working on it. It is huge though.

  • Am I missing a delicate point that prevents us from using it
  • Should it be a 1.3.12 milestone?

brunal avatar Apr 09 '15 16:04 brunal

I'm all for it. Our current path handling stuff is too fiddly.

I have one important concern: encodings. The core problem is that Unix paths need to be bytes and Windows paths need to be Unicode strings (more or less). We need to store paths in a database, so eventually they need to be represented internally as one or the other. Beets currently chooses byte strings, and on Windows the "true" paths are encoded with UTF-8 just for storage and uniform manipulation. I'm not quite clear on how pathlib would interact with this constraint.

I also don't know how pathlib handles long names on Windows (the issue described here).

Maybe the right approach here would be to start with a small prototype, see how it goes on different platforms, and then attempt the full migration.

sampsyo avatar Apr 09 '15 22:04 sampsyo

Currently,

  • all paths are byte strings
  • when there is a path in the input command we encode it with beets.util._fsencoding()
  • when we display a path we decode it with that same encoding
  • displayable_path() does that operation
  • syspath() does some ugly dark magic
  • bytestring_path() ensures always gives a byte string path, no matter the input
  • normpath() expands everything and gives a byte string

Is that right? Am I forgetting something?

Python 3.2 introduced os.fs{en,de}code(): https://docs.python.org/3/library/os.html#os.fsencode The source is pretty simple (Lib/os.py:800) and shares some traits with our functions. However mbcs encoding only triggers another error mode: strict instead of surrogateescape.

I count ~100 {en,de}code calls in the beets/ folder. It'd be good to get rid of most of them.

brunal avatar Apr 10 '15 10:04 brunal

  • surrogateescape is not natively available in python 2
  • pathlib expects unicode paths
  • pathlib port on python 2 is not up-to-date with the 3.4 version and does not use surrogateescape (since it's not natively available)

I think the solution is to use python-future (https://github.com/PythonCharmers/python-future) which provides a future.utils.surrogateescape module.

brunal avatar Apr 10 '15 11:04 brunal

Yes, surrogate escaping is probably now the Pythonic way of doing this. A few thoughts/questions:

  • How hard will it be to be to make pathlib use surrogate escaping on Python 2 via python-future?
  • How will this work with people's existing databases? Shall we incrementally "upgrade" current bytestring paths to surrogate-escaped Unicode?
  • How good is our current test coverage for unexpected, badly-encoded filenames? A few more tests might be useful to make sure we don't break anything.
  • I'm not sure what to do about that "strict" mode on Windows in the new fsencode. We'll at least need to handle this case to avoid crashing.
  • A cursory glance at the source suggests that pathlib does not, by itself, do the \\?\ prefix on Windows that syspath does. It's possible I'm missing something, but if that's the case, it's an important oversight—we'll need to find some way to add that back. :grimacing:

sampsyo avatar Apr 10 '15 16:04 sampsyo

Roughly,

  • I don't know yet. I feel like I'll won't be able to use the proposed Pathlib backport but another one that would install surrogate escaping from python-future before the real pathlib code.
  • I'm not sure updating the db model is needed: just take the bytes before sending it to the db
  • It's decent but not that good. Right now it breaks every few weeks (due to future.unicode_literals)
  • I believe Windows only provides Unicode filenames and deals with encoding problems itself, so we don't need to do anything
  • Indeed it does not! Another thing missing is path expansion (~ → /home/foo). Since Pathlib is OO, it's just a matter of subclassing it and adding/overriding required methods.

brunal avatar Apr 10 '15 17:04 brunal

OK, this migration plan sounds good.

On Windows exceptions: Yes, if the OS can be trusted to always supply Unicode, then we should be OK. Hopefully, paths do not come from any other source that could contain surrogate escapes.

On the backport: FWIW, it looks like someone has tried starting a maintained backport, but it appears abandoned.

sampsyo avatar Apr 10 '15 18:04 sampsyo

I don't think that pathlib is able to replace truncate/sanitize _path from what I've seen of it. It's a module for providing many of the functions of os.path with some corrections in an object-oriented way, and making some useful information about the path available (eg. root, suffix, filename). While this in itself is useful and could clean up a lot of code, it looks to me like we would still have to do the encoding/decoding of the path, and removal of unsafe characters.

LordSputnik avatar Jul 14 '15 21:07 LordSputnik

Agreed; while pathlib will definitely be nice, it will not solve our encoding and sanitation problems. Those will still be 100% up to us.

sampsyo avatar Jul 14 '15 22:07 sampsyo

there's now a maintained pathlib port that matches the stdlib version https://github.com/mcmtroffaes/pathlib2

@brunal : do you still wanna work on this? I'd love to help out.

jrobeson avatar Jul 08 '16 23:07 jrobeson

I'd like to tackle this and I want to make sure I'm considering all the potential issues. The proposed solution I am thinking is using pathlib everywhere and storing paths as strings in the database (using pathlib's string representation).

PEP 519 on pathlib

The hope is that people will gravitate towards path objects like pathlib and that will move people away from operating directly with bytes.

Some current issues:

  1. Long windows paths

    • If I'm understanding it correctly, the issue is basically defined here.

    Issue: Pathlib doesn't seem to correctly deal with these situations, as illustrated by the answer to that stackoverflow question. Potential Solution: Don't attempt to solve this on beet's end. Instead, direct users to enable long path support explicitly. This is the same decision that python made.

  2. Dealing with user's current databases Issue: Currently, all user's paths are bytes, and pathlib doesn't accept bytes. Potential Solution: I'm not the most experienced with databases, but is this something that could be solved with a database migration?

  3. Unix uses bytes Issue: So pathlib abstracts this layer and has it's own method for opening files and such. However, how do we deal with native Unix utilities that return and deal with bytes? Potential Solution: Python3 has surrogate escapes, which can represent bytes as strings, which can then be passed to pathlib to create a path object. These path objects then get passed around instead of the surrogate escape strings, which should hopefully prevent any unforeseen issues that come with using surrogate escapes directly.

Other notes:

  1. As far as enforcement, type hints and then static type checking may prove to be useful.
  2. PEP 519 (mentioned above) introduces a lot of nice adoption features for pathlib, notably, most (all?) os modules and open now accept pathlib objects. The issue is that the above PEP was introduced to Python 3.6. Whether or not these features are necessary for beet's adoption, I'm not sure.

These are my rough initial thoughts on the matter, and I'm only just now looking into these issues, so I'm sure I'm missing something and am hoping others can help fill in the gaps so we can work this out.

Other relevant PEPS: 529, 538, 540, 383

jtpavlock avatar Jul 28 '20 06:07 jtpavlock

Hi! Thanks for being interested in taking this on!

I think the big obstacle here is about Unicode, surrogate escapes, and SQLite storage. Namely, because surrogate escapes are a Python-specific quirk, SQLite cannot store strings that contain them. For example:

>>> weird_string = 'café'.encode('latin1').decode('utf8', 'surrogateescape')
>>> import sqlite3
>>> sqlite3.connect(':memory:').execute('select ?;', (weird_string,)).fetchall()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 3: surrogates not allowed

Basically, SQLite stores values as bytes internally. Python tries to "pretend" that it's storing Unicode objects by relying on UTF-8 encoding.

Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system.


Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better.


Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with syspath. Maybe with a subclass of WindowsPath or something.

An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere.


Finally finally, you briefly remarked that type annotations could help with this. I agree! Maybe a good plan of attack would be (after dropping Python 2 support) just go whole-hog with type annotations in parts of the code that deal with paths. Then a later phase can explore holistic bytes -> Path conversion.

sampsyo avatar Jul 28 '20 11:07 sampsyo

Regardless, it seems easier to attempt such a migration while leaving the database intact. At some point, we will need to convert between Path objects and "primitive" values that can go into the database. We might as well continue using the same conversion we use now—namely, represent the paths as the actual bytes that came from the operating system.

I'll keep it as is, at least for the first iteration.

Separately, one thing I'm pretty worried about here is the interaction between command-line argument encodings and filesystem encodings. On Unix, if you specify a file as a CLI argument, that's bytes—and probably exactly the same bytes that appear on the filesystem. Python 3, however, will try to decode that using the argument encoding and then, when you send it to the filesystem, re-encode it using the filesystem encoding—neither of which might actually match the filename bytes as written! Surrogate escapes don't really solve this because it has to do with representable bytes rather than unrepresentable ones. I'm not sure how pathlib will make this worse or better.

Hopefully this is something the abstraction of the os module and pathlib objects will take care of. I suppose we'll see if that's true or not.

Finally, for long filenames on Windows, I wonder whether we couldn't find a nice pathlib-native solution. I don't think the idea from SO where the actual Path object contains the magic prefix is the way to go, but maybe we can convince the library to add the prefix just before calling an OS function, as we currently do manually with syspath. Maybe with a subclass of WindowsPath or something.

An ambitious alternative would be to roll our own PEP 519-based library that uses a native representation everywhere.

Might be worth digging into. If it begins to create more problems than it's worth, then maybe we can decide otherwise.

jtpavlock avatar Jul 28 '20 15:07 jtpavlock

My initial plan is to pass pathlib.PurePath objects around everywhere, converting to pathlib.Path objects if actual file operations are needed, and converting the objects to and from bytes for the database. Consider the following example:

>>> weird_bytes = os.fsencode('café')
>>> weird_bytes
b'caf\xc3\xa9'
>>> weird_str = os.fsdecode(weird_bytes)
>>> weird_str
'café'
>>> p = Path(weird_str)
>>> p
PosixPath('café')
>>> bytes(p)
b'caf\xc3\xa9'

Does this seem reasonable?

Note, it is possible to call str(path) for all windows paths and bytes(path) for all unix paths for storing in the database. I'm not sure if that makes things better or not.

The more I get into it though, the more it seems life would be much easier if we had the benefits of PEP 519 i.e. dropping python 3.5 support. For reference, here's the download stats for the last 180 days:

$ pypinfo -d 180 beets pyversion 
Served from cache: False
Data processed: 643.30 GiB
Data billed: 643.30 GiB
Estimated cost: $3.15

| python_version | download_count |
| -------------- | -------------- |
| 3.8            |          4,146 |
| 3.7            |          2,570 |
| 2.7            |          1,923 |
| 3.6            |          1,861 |
| 3.5            |            132 |
| 3.9            |             77 |
| 3.4            |             32 |
| 3.3            |              2 |
| Total          |         10,743 |

I wish I could do more, but I hit my free quota on these stats 😆

jtpavlock avatar Jul 28 '20 23:07 jtpavlock

Does this seem reasonable?

Note, it is possible to call str(path) for all windows paths and bytes(path) for all unix paths for storing in the database. I'm not sure if that makes things better or not.

That seems reasonable! The only difference is that we will want to make sure we are doing the same str-to-bytes conversion as we currently do on Windows for storage.

No objection to dropping Python 3.5 at this point.

sampsyo avatar Jul 29 '20 01:07 sampsyo

No objection to dropping Python 3.5 at this point.

Great, this would be much harder to implement without the 3.6 improvements, and would probably be worth putting off at that point. I should have this done by the time beets 1.5.x (or whatever the next one is) gets released.

jtpavlock avatar Jul 29 '20 05:07 jtpavlock

I'm happy to give this a shot, it seems like it'd be very helpful.

Serene-Arc avatar Sep 13 '22 06:09 Serene-Arc