Use locale.nl_langinfo in `_strptime.py`
| BPO | 8915 |
|---|---|
| Nosy | @abalkin |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
assignee = 'https://github.com/abalkin'
closed_at = None
created_at = <Date 2010-06-06.03:35:08.311>
labels = ['3.7', 'type-feature', 'library']
title = 'Use locale.nl_langinfo in _strptime'
updated_at = <Date 2016-09-10.18:34:48.582>
user = 'https://github.com/brettcannon'
bugs.python.org fields:
activity = <Date 2016-09-10.18:34:48.582>
actor = 'belopolsky'
assignee = 'belopolsky'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2010-06-06.03:35:08.311>
creator = 'brett.cannon'
dependencies = []
files = []
hgrepos = []
issue_num = 8915
keywords = []
message_count = 3.0
messages = ['107181', '107445', '125958']
nosy_count = 1.0
nosy_names = ['belopolsky']
pr_nums = []
priority = 'low'
resolution = None
stage = 'needs patch'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue8915'
versions = ['Python 3.7']
It might perform better to use locale.nl_langinfo to get the current locale's datetime information instead of reverse-engineering from strftime (need to benchmark to see if this is true). This would need to be conditional as the datetime info might not be exposed through nl_langinfo.
See also bpo-8957. If this happens, I would like to add a pure python implementation strftime. See also bpo-7989.
I would also like to consider using OS strptime on platforms with a decent implementation.
Not fully possible, nl_langinfo doesn't support LC_TIME.
We can partially use it, ~33% faster.
def _getlang():
lang = locale.setlocale(locale.LC_TIME, None)
encoding = locale.nl_langinfo(locale.CODESET)
return lang, encoding
$ python3.14 -m timeit -s "from _strptime import _getlang" "_getlang()"
1000000 loops, best of 5: 298 nsec per loop
$ ./python -m timeit -s "from _strptime import _getlang" "_getlang()"
1000000 loops, best of 5: 203 nsec per loop # ~33% faster
Caching the information will result in a decent performance gain.
I would also like to consider using OS strptime on platforms with a decent implementation.
This should be a separate issue.
@picnixz _strptime is currently pure python. This does not need the extension-modules label. It does however need the performance label :-)
I never know which part of the date/time API is duplicated in C and which one is not so thanks.
nl_langinfo() supports LC_TIME.
But there is other issue. Month and weekday names returned by nl_langinfo() and strftime() can be different.
- In br_FR locale: "
'" vs "ʼ" (U+02BC). - In ast_ES, ca_AD, ca_ES, ca_FR, ca_IT, ca_ES, oc_FR and wa_BE locales: "
'" vs "’" (U+2019). - In yi_US locale: "
אַ" (U+FB2E) vs "אַ" (U+05D0 U+05B7) and many others.
I suppose this is because strftime() is implemented using wcsftime() if it is available, but nl_langinfo() needs decoding from the current locale encoding. 8-bit encodings (ISO8859-1 for br_FR and ca_FR, CP1255 for yi_US, etc) can replace some Unicode characters with other similarly looking characters.
This issue exists also in the current code: strptime() is not always able to parse string formatted in C or other language or in Python on other platform, even if they support the same locale. strptime() should be more lenient and accept different forms of apostrophes and different form of normalization. This is a different issue, but we cannot just use nl_langinfo() without breaking existing tests until it is fixed.
On glibc platforms we can also use private API to get Unicode result of nl_langinfo(), without using intermediate locale encoding. This may be faster and allows to avoid temporary switching the current locale. This will help also for Python implementation of strftime(). This is also a different issue, but it will help to use nl_langinfo().
There is other issue. On some locales there are different names for months and weekdays, it is not a matter of some normalization:
- be_BY.utf8: "чэрвеня" vs "červienia".
- tt_RU.utf8: "июнь" vs "yün".
- nan_TW@latin: "6goe̍h" vs "六月".
- sr_RS@latin: "jun" vs "јун".
- ug_CN@latin: "Seper" vs "ئىيۇن".
- uz_UZ@cyrillic: "Июн" vs "Iyun".
- sd_IN@devanagari: "जूनि" vs "جون".
It seems that the modifier (@latin, @cyrillic, @devanagari) is just ignored in some cases (this is the locale module bug). But I do not know why there is a difference for be_BY.utf8 and tt_RU.utf8.