<regex>: basic_regex wants regex_traits to provide things not required by [re.req]
Describe the bug
basic_regex relies on non-standard extensions of regex_traits
Command-line test case
d:\Temp2>type repro.cpp
#include <array>
#include <iostream>
#include <regex>
struct custom_traits : private std::regex_traits<char>
{
using base_t = std::regex_traits<char>;
using base_t::regex_traits;
using base_t::char_type;
using base_t::locale_type;
using base_t::string_type;
using base_t::char_class_type;
using base_t::length;
using base_t::translate;
using base_t::translate_nocase;
using base_t::transform;
using base_t::transform_primary;
using base_t::lookup_collatename;
using base_t::lookup_classname;
using base_t::isctype;
using base_t::value;
using base_t::imbue;
using base_t::getloc;
#ifdef WORKAROUND
using base_t::_Ch_upper;
using base_t::_Uelem;
using base_t::_Ch_alpha;
#endif
};
int main()
{
std::basic_regex<char, custom_traits> r("3");
return 0;
}
d:\Temp2>cl /EHsc /W4 /WX .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29009.1 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
repro.cpp
c:\Program Files (x86)\Microsoft Visual Studio\2019\Preview\VC\Tools\MSVC\14.27.29009\include\regex(3975): error C2247: 'std::_Regex_traits_base::_Ch_upper' not accessible because 'custom_traits' uses 'private' to inherit from 'std::regex_traits<char>'
c:\Program Files (x86)\Microsoft Visual Studio\2019\Preview\VC\Tools\MSVC\14.27.29009\include\regex(238): note: see declaration of 'std::_Regex_traits_base::_Ch_upper'
.\repro.cpp(5): note: see declaration of 'custom_traits'
c:\Program Files (x86)\Microsoft Visual Studio\2019\Preview\VC\Tools\MSVC\14.27.29009\include\regex(393): note: see declaration of 'std::regex_traits<char>'
c:\Program Files (x86)\Microsoft Visual Studio\2019\Preview\VC\Tools\MSVC\14.27.29009\include\regex(3964): note: while compiling class template member function 'bool std::_Parser<_FwdIt,_Elem,_RxTraits>::_CharacterClassEscape(bool)'
** a lot of more errors **
d:\Temp2>cl /EHsc /W4 /WX /DWORKAROUND .\repro.cpp
Microsoft (R) C/C++ Optimizing Compiler Version 19.27.29009.1 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
repro.cpp
Microsoft (R) Incremental Linker Version 14.27.29009.1
Copyright (C) Microsoft Corporation. All rights reserved.
/out:repro.exe
repro.obj
Expected behavior Should compile without workaround. May compile or not compile with workaround,
STL version
Microsoft Visual Studio Professional 2019 Preview
Version 16.7.0 Preview 3.1
Additional context This item is also tracked on Developer Community as DevCom-322328 and by Microsoft-internal VSO-673897 / AB#673897.
vNext note: Resolving this issue will require breaking binary compatibility. We won't be able to accept pull requests for this issue until the vNext branch is available. See #169 for more information.
Which part(s) is considered vNext? It seems that dependency on _Ugly names can be easily avoided, but there'll be incorrect runtime behavior when the character classification is customized (where the fix is unknown to me).
I suspect vNext on this issue was really "we want to rewrite this entire mess in vNext and that will be 'soon' so let's avoid wasting time on it for now".
I'd be surprised if we couldn't fix this ABI-compatibly with a bit of compile-time middleware that uses the _Ugly names if present and works around the lack if they are not present. I still don't think it is a high priority to fix, but it shouldn't be impossible to fix.
Beware: I have concerns that regex_traits isn't fully specified to the degree necessary to allow program-defined specializations. It's been too long since I looked so I don't recall specifics, but I'll throw some FUD here so anyone who does look will go in with eyes open.
#5403 will remove the only usage of _Ch_upper. We can also easily avoid the usage of _Ch_alpha because it is used in a context where the ECMAScript grammar only allows A-Z and a-z and extending it to more alphabetic characters doesn't make sense.
The difficult case is _Uelem. <regex> already assumes in some places that make_unsigned<_Elem> is well-defined. So one natural fix is to replace all uses of _Uelem by make_unsigned<_Elem>.
Alas, I suspect this approach is going to break this piece of ~~UB art~~ code in libbuild: https://github.com/build2/build2/blob/d0e7e2a408a5d72d1310eef39127629ab0c9b38e/libbuild2/script/regex.hxx. The reason is that make_unsigned<_Elem> does not actually yield an unsigned type, so I think the following static_assert is going to fail:
https://github.com/microsoft/STL/blob/5762e6bcaf7f5f8b5dba0a9aabf0acbd0e335e80/stl/inc/regex#L1325
And since I already found one project that will break with this change, there is a good chance there are more.
Here are a few alternative approaches how we can tackle this:
- Use
_Uelemin the traits class when it is there and otherwise fall back tomake_unsigned<_Elem>. This is 100 % backward compatible, but adds some complexity to<regex>. - Make the code work with some user-defined types for
make_unsigned<_Elem>. One possibility is to change the static_assert to assert!is_signed_v<_Elem>(I'm not in favor of removing it completely because it can catch actual bugs in our regex code). I haven't tried if this change is sufficient for libbuild. - Fix the code in the projects that will break to the extent we can. Due to the underspecification of regex traits, I think UB can't be avoided completely when using a user-defined character type. In libbuild's case, it might be good enough to change the specialization of
make_unsigned<line_char>to yieldunsigned intor similar, but it would have to be checked that this change would also work with the regex implementations of the other standard libraries.
Any thoughts on this? I'm leaning towards alternative 1 to avoid trouble.
Users aren't allowed to specialize make_unsigned. Unfortunately we aren't enforcing that, and we don't have a type trait optimization for this one, so they're getting away with it.
Alternative 1 (use _Uelem, fallback to make_unsigned_t<_Elem>) is no worse than the status quo and shouldn't be that much complexity (and importantly, it'll be centralized), so I'm fine with that.
It would be nice to notify build2 that they're doing something squirrelly and non-Standard in at least 2 ways.
I think the build2 maintainers (or at least the people who wrote the regex adapter) are aware that they did some very questionable things there. But I guess you don't have much choice there if you want to use regex with a custom character type: Because the requirements for the character type are underspecified and the traits class is missing some member functions and types, all standard libraries add some requirements of their own. And one relatively common one seems to be that make_unsigned<_Elem> makes sense.
Maybe make_unsigned and _Uelem can even be avoided completely, but such a change will be much more complex than adding a fallback for missing _Uelem.
I did a quick proof-of-concept and am reasonably confident now that it is actually possible to eliminate _Uelem completely. We only have to impose two requirements:
- Converting between integral types and the character type must be possible (including lossy conversions from the character type to some integral type).
- There must be a one-to-one-mapping between integers (for an integral type that is big enough) and characters, and the conversions must be consistent with it (up to lossiness).
Note that the standard already implies that integers must be convertible to characters because some escape sequences need this. I'm not sure about the other direction, but as far as I know all implementations of this regex interface (including Boost.Regex) do assume that this is possible as well. For example, it wouldn't be possible to represent a character class using a bitmap otherwise.
However, some of the resulting code tends to look ugly because of chained casts. As an example, let's consider the test whether the code point for a character _Ch fits into the bitmap for the first 256 characters. With _Uelem, you can do static_cast<_Uelem>(_Ch) < _Bmp_max to perform this test. Without _Uelem, this check turns into static_cast<_Elem>(static_cast<unsigned char>(_Ch)) == _Ch.
In some other places, we have to delegate some operations to char_traits, but I think this is fine as well because the standard requires that there must be a working implementation of std::char_traits for the provided character type.