borg icon indicating copy to clipboard operation
borg copied to clipboard

Traceback error on unicode character in file name

Open paddylandau opened this issue 3 years ago • 29 comments

tl;dr

When any file name in the entire archive has a unicode character, even with export LANG=en_US.UTF-8:

  • When mounting the entire repository, the archive cannot be viewed ("Input/output error")
  • When attempting to mount an archive, a traceback error happens

Have you checked borgbackup docs, FAQ, and open Github issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

BUG

System information. For client/server mode post info for both machines.

Your borg version (borg -V).

borg 1.2.2

SHA256 of the executable:

29f68bd4f8b524f0c2c530d5679ea1a7fcce6bb6ffe16dbe7d07b19dbebf794a

Operating system (distribution) and version.

Linux Ubuntu 22.04 LTS

Hardware / network configuration, and filesystems used.

Hardware: Dell OptPlex Filesystems: ext4 Network: None (backup done locally on the machine itself to an external USB drive)

How much data is handled by borg?

Tested separately (independent repositories):

  • Small test area (less than 1 Mb)
  • A large backup of 66 Gb

Full borg commandline that lead to the problem (leave away excludes and passwords)

borg mount -o noatime /media/paddy/mp1bu/borg/glinda::09-02T18-21 /media/paddy/diff

Describe the problem you're observing.

Can you reproduce the problem? If so, describe how. If not, describe troubleshooting steps you took before opening the issue.

Yes, reliably reproducible on any repository, including a brand new one. Full details below, including symptom, traceback error, and what causes it.

(I posted this on Reddit before I discovered the cause. I have repeated the information below.)

Basic information about the repository

$ borg info /media/paddy/mp1bu/borg/glinda
Repository ID: [retracted]
Location: /media/paddy/mp1bu/borg/glinda
Encrypted: No
Cache: /home/paddy/.cache/borg/[retracted]
Security dir: /home/paddy/.config/borg/security/[retracted]
------------------------------------------------------------------------------
                       Original size      Compressed size    Deduplicated size
All archives:               66.79 GB             63.30 GB             61.35 GB

                       Unique chunks         Total chunks
Chunk index:                   84729               128757

List of archives (just the one so far)

$ borg list /media/paddy/mp1bu/borg/glinda
09-02T18-21                          Fri, 2022-09-02 18:21:33 [retracted]

Listing the archive contents works correctly:

$ borg list /media/paddy/mp1bu/borg/glinda::09-02T18-21
[lots of files that all look correct]

First symptom

If I mount my entire repository, at first it seems to work, but then…

$ borg mount -o noatime /media/paddy/mp1bu/borg/glinda /media/paddy/diff
$ cd /media/paddy/diff
$ ls -l
total 0
drwxr-xr-x 1 paddy paddy 0 Sep  2 18:21 09-02T18-21
$ cd 09-02T18-21
$ ls -l
ls: cannot open directory '.': Input/output error

As you can see, I can't view the mounted archive directory.

I can umount OK:

$ cd
$ borg umount /media/paddy/diff

Traceback error

If I try mounting just the archive instead of the entire repository, I get a traceback error.

$ borg mount -o noatime /media/paddy/mp1bu/borg/glinda::09-02T18-21 /media/paddy/diff
Mounting filesystem
Local Exception
Traceback (most recent call last):
  File "borg/archiver.py", line 5159, in main
  File "borg/archiver.py", line 5090, in run
  File "borg/archiver.py", line 1349, in do_mount
  File "borg/archiver.py", line 183, in wrapper
  File "borg/archiver.py", line 1359, in _do_mount
  File "borg/fuse.py", line 545, in mount
  File "borg/fuse.py", line 278, in _create_filesystem
  File "borg/fuse.py", line 355, in _process_archive
  File "os.py", line 812, in fsencode
UnicodeEncodeError: 'ascii' codec can't encode character '\u2026' in position 81: ordinal not in range(128)

Platform: Linux glinda 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64
Linux: Unknown Linux  
Borg: 1.2.2  Python: CPython 3.9.13 msgpack: 1.0.4 fuse: llfuse 1.4.2 [pyfuse3,llfuse]
PID: 55441  CWD: /home/paddy
sys.argv: ['borg', 'mount', '-o', 'noatime', '--verbose', '/media/paddy/mp1bu/borg/glinda::09-02T18-21', '/media/paddy/diff']
SSH_ORIGINAL_COMMAND: None

The cause

After a process of elimination, I found that the error happens whenever a file name (not the file contents) contains a unicode character.

It can even be as simple as an accented character such as é.

I have tested this on a brand new repository with just one file in the archive. It works when the file name doesn't have a unicode character, and crashes when the file name has a unicode character.

This is mentioned in the FAQ, but the proposed solution doesn't work.

  • My default language is LANG=en_GB.UTF-8
  • I deleted the repository, set export LANG=LANG=en_US.UTF-8 as per the FAQ, and recreated the repository from scratch. It made no difference.

If I exclude all files with a unicode character, BorgBackup works correctly. Unfortunately, this isn't a suitable workaround for me, as I am backing up large numbers of files with such unicode characters, many of which I'm not at liberty to rename.

paddylandau avatar Sep 03 '22 11:09 paddylandau

  File "os.py", line 812, in fsencode
UnicodeEncodeError: 'ascii' codec can't encode character '\u2026' in position 81: ordinal not in range(128)

Notable:

  • that is os.fsencode, a python standard library function
  • it uses the ascii encoder, not the utf-8 encoder as one would expect with LANG=en_GB.UTF-8.

Had a look at the code lines as seen in the traceback, didn't see anything that's obviously incorrect there.

ThomasWaldmann avatar Sep 03 '22 13:09 ThomasWaldmann

Is the locale you set actually available on your system?

Try:

dpkg-reconfigure -plow locales
# select all locales you need and at least one UTF-8 locale you intend to use with borg.

ThomasWaldmann avatar Sep 03 '22 13:09 ThomasWaldmann

dpkg-reconfigure -plow locales

These two are already marked:

  • en_GB.UTF-8 UTF-8
  • en_US.UTF-8 UTF-8

Would it help if I tried a different combination? If so, which ones?

paddylandau avatar Sep 03 '22 14:09 paddylandau

No, guess these are fine. Just make sure that:

  • there is no typo or so in LANG=
  • that setting is also active (and exported) in the borg environment (might be different user/shell/whatever).

ThomasWaldmann avatar Sep 03 '22 14:09 ThomasWaldmann

https://docs.python.org/3/library/os.html#python-utf-8-mode

There are some further things to try that likely solve your problem - although it would be interesting why it does not work as you have it now. Maybe check the current values of the other env vars mentioned in these docs.

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

  • there is no typo or so in LANG=
  • that setting is also active (and exported) in the borg environment (might be different user/shell/whatever).

I have checked and double-checked, and done it multiple times. I use copy-and-paste (specifically from the FAQ), and being sure that I included export, because your documentation makes that clear.

paddylandau avatar Sep 03 '22 15:09 paddylandau

What's the LC_CTYPE in the borg env?

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

https://docs.python.org/3/library/os.html#python-utf-8-mode

There are some further things to try that likely solve your problem - although it would be interesting why it does not work as you have it now. Maybe check the current values of the other env vars mentioned in these docs.

Unfortunately, that link goes way above my head. I can program in Bash, and that's it; I wouldn't know where to start with Python.

I'm using the standalone binary downloaded from your website.

paddylandau avatar Sep 03 '22 15:09 paddylandau

Can you try this:

$ python3
>>> import sys
>>> sys.getfilesystemencoding()
'utf-8'

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

What's the LC_CTYPE in the borg env?

The command echo $LC_CTYPE returns nothing; the environment variable is unset.

What should I set it to?

paddylandau avatar Sep 03 '22 15:09 paddylandau

$ python3
>>> import sys
>>> sys.getfilesystemencoding()
'utf-8'

I get the same as you:

$ python3
Python 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getfilesystemencoding()
'utf-8'
>>> 

paddylandau avatar Sep 03 '22 15:09 paddylandau

On my mac, if have this:

$ echo $LANG

$ echo $LC_CTYPE
en_DE.UTF-8

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

Oh, that's interesting.

But I'ld assume that if the sys.getfilesystemencoding() returns utf-8, it should not use the ascii encoder.

Hmm, do we have some strange effect related to the pyinstaller-made binary?

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

On my mac, if have this:

$ echo $LANG

$ echo $LC_CTYPE
en_DE.UTF-8

Mine is the opposite way around!

$ echo $LANG
en_GB.UTF-8
$ echo $LC_TYPE

Shall I export LC_TYPE=$LANG?

paddylandau avatar Sep 03 '22 15:09 paddylandau

You can try, but iirc the fallback of LC_CTYPE might be the value in LANG anyway.

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

You can try, but iirc the fallback of LC_CTYPE might be the value in LANG anyway.

I'll set up a test, and get back to you soon.

paddylandau avatar Sep 03 '22 15:09 paddylandau

Please add the sha256 hash of the binary you use in the toplevel post after the version number.

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

export LC_TYPE=$LANG made no difference, as you expected.

Please add the sha256 hash of the binary you use in the toplevel post after the version number.

Done

paddylandau avatar Sep 03 '22 15:09 paddylandau

Please also try if using borg mount --foreground ... makes a difference (if you use that, the borg process will not fork and run in the background, but instead keep running in the foreground [and blocking that terminal, so you'll need to switch to another one to continue]).

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

borg mount --foreground ...

Unfortunately, it still crashed with the same Traceback error.

$ borg mount -o noatime --foreground /media/paddy/mp1general/borgtest::ascii /media/paddy/diff
Local Exception
Traceback (most recent call last):
  File "borg/archiver.py", line 5159, in main
  File "borg/archiver.py", line 5090, in run
  File "borg/archiver.py", line 1349, in do_mount
  File "borg/archiver.py", line 183, in wrapper
  File "borg/archiver.py", line 1359, in _do_mount
  File "borg/fuse.py", line 545, in mount
  File "borg/fuse.py", line 278, in _create_filesystem
  File "borg/fuse.py", line 355, in _process_archive
  File "os.py", line 812, in fsencode
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 27: ordinal not in range(128)

Platform: Linux glinda 5.15.0-47-generic #51-Ubuntu SMP Thu Aug 11 07:51:15 UTC 2022 x86_64
Linux: Unknown Linux  
Borg: 1.2.2  Python: CPython 3.9.13 msgpack: 1.0.4 fuse: llfuse 1.4.2 [pyfuse3,llfuse]
PID: 117027  CWD: /home/paddy
sys.argv: ['borg', 'mount', '-o', 'noatime', '--foreground', '/media/paddy/mp1general/borgtest::ascii', '/media/paddy/diff']
SSH_ORIGINAL_COMMAND: None

paddylandau avatar Sep 03 '22 15:09 paddylandau

I switched to my ubuntu 20.04 machine and did some experiments:

$ echo $LANG
en_US.UTF-8
$ echo $LC_CTYPE

$ mkdir test
$ cd test
$ mkdir input mnt
$ touch input/123
$ touch input/äöü   # non-ascii chars
$ wget https://github.com/borgbackup/borg/releases/download/1.2.2/borg-linux64
$ sha256sum borg-linux64  # same as in top post
$ chmod +x borg-linux64
$ ./borg-linux64 init -e none repo
$ ./borg-linux64 create repo::arch input
$ ./borg-linux64 mount repo::arch mnt
$ ls mnt/input
123 äöü

So, works for me.

ThomasWaldmann avatar Sep 03 '22 15:09 ThomasWaldmann

So, works for me.

OK, I'll create a VM with fresh installations of Ubuntu 20.04 and another with Ubuntu 22.04 to see how they work.

Then we can see if it's specific to Ubuntu 22.04 or just to my setup (which is a fresh setup, installed just 6 days ago).

I don't have time left today, so I'll get back to you once I've done this.

paddylandau avatar Sep 03 '22 15:09 paddylandau

Shot in the dark: Does forcing the Python I/O encoding via

export PYTHONIOENCODING="utf8"

before running borg work? Or similarly trying to run it as env PYTHONIOENCODING=utf8 /path/to/borg instead of plain /path/to/borg.

mihailim avatar Sep 03 '22 16:09 mihailim

export PYTHONIOENCODING="utf8"

Or similarly trying to run it as env PYTHONIOENCODING=utf8 /path/to/borg

I did try both, but neither made a difference. The error message was the same. I set PYTHONIOENCODING to an invalid value (to see what would happen), and Python complained about that, so we know that it is being looked at.

paddylandau avatar Sep 03 '22 17:09 paddylandau

I made time to test this in a VM, and although I haven't by any means solved the problem, we can certainly narrow it down.

My VM version of Ubuntu 20.04 and Ubuntu 22.04 both work. But my main machine with Ubuntu 22.04 — a fresh installation just 6 days old — doesn't.

Nevertheless, I spotted something.

Here's my output from both VM versions when I list the file in the terminal:

-rw-rw-r-- 1 paddy paddy 5 Sep  3 18:04 äöü

But, here's the output from my main machine when I list the file in the terminal:

-rw------- 1 paddy paddy 10 Sep  3 11:15 ''$'\303\244\303\266\303\274'

If you happen to know what this means, please let me know, otherwise I'll attend to it tomorrow.

paddylandau avatar Sep 03 '22 17:09 paddylandau

@paddylandau try run locale-gen en_GB.UTF-8 en_US.UTF-8 and then reproduce the problem also check LC_CTYPE in your ~/.profile and also check /etc/ssh/sshd_config for any LANG LC_* settings

infectormp avatar Sep 03 '22 18:09 infectormp

I set LC_CTYPE in my profile and ran locale-gen en_GB.UTF-8 en_US.UTF-8. I don't have a file /etc/ssh/sshd_config, but I'm not using SSH anyway; this is local on my machine with the Borg backup directly onto a USB hard drive.

On top of all that, I also tried this solution.

I rebooted, but sadly none of this helped.

I shall ask on the Ubuntu Forums for help. I'll update this post with the link, and post back here should I find the answer.

Thank you for all of your time on this matter. I do appreciate it.

paddylandau avatar Sep 04 '22 09:09 paddylandau

Well, I finally found the problem — and it's nothing to do with BorgBackup!

I had LC_ALL=C. This messed up everything!

I've unset LC_ALL, and everything works correctly now.

Thank you again for all the time and effort that you have put into this. Sorry to have wasted your time, but you definitely did help push me in the right direction to find the solution.

I hope that this helps someone else.

paddylandau avatar Sep 04 '22 12:09 paddylandau

Maybe we could add this to the docs. ^^^

Can someone make a pull request against master branch?

ThomasWaldmann avatar Sep 04 '22 18:09 ThomasWaldmann