locale icon indicating copy to clipboard operation
locale copied to clipboard

Boost.locale makes std::regex not match anything

Open Lord-Kamina opened this issue 11 months ago • 4 comments

I had initially posted a comment in #35, but maybe it deserves its own issue instead. I think it's essentially the same problem, except I'm on macOS 13,

$ clang++ -v
Apple clang version 15.0.0 (clang-1500.1.0.2.5)
Target: x86_64-apple-darwin22.6.0.

I'm using boost 1.86.0, built against ICU 74.2

I had seen this behavior before, and have never found a real solution. I now stumbled upon it again on a project. I spent about two days trying to tune my regex, thinking I must have made a mistake. Eventually I began simplifying it and simplifying it, without it resolving.

Eventually, I decided to make a minimal example to test it; so I have following code:

#include <boost/locale.hpp>
#include <iostream>
#include <locale>
#include <regex>
#include <string>

int main() {
	boost::locale::generator locGen;
	const std::locale loc = locGen("en_US.UTF-8");
// 	std::locale::global(loc);  
	auto pattern = std::regex(R"(^(?:\s)*([_[:alnum:].-]+)\s*=\s*([^;#\n\r]+)*)");
// 	pattern.imbue(loc);
	const std::string text{"  pozo = mani"};
	std::smatch result;
	std::regex_search(text, result, pattern);
	std::cout << "ready: " << result.ready() << ", size: " << result.size() << std::endl;
	for (size_t i=0; i < result.size(); i++) {
		std::cout << "match[" <<i<<"]: " << result[i] <<std::endl;
	}
	return 0;
}

Which outputs

$ clang++ -o regex_test regex_test.cpp -std=c++17 -I/opt/local/include/ -lboost_locale-mt -lboost_system-mt -L/opt/local/lib && ./regex_test
ready: 1, size: 3
match[0]:   pozo = mani
match[1]: pozo
match[2]: mani

If I uncomment the std::locale::global line (with or without the pattern.imbue), this happens instead:

clang++ -o regex_test regex_test.cpp -std=c++17 -I/opt/local/include/ -lboost_locale-mt -lboost_system-mt -L/opt/local/lib && ./regex_test
ready: 1, size: 0

I tried changing facets gradually, OR'ing them one by one and it always worked until I added std::locale::collate. From that point, removing all the others and keeping just std::locale::locate, still makes the regex not work.

#include <boost/locale.hpp>
#include <iostream>
#include <locale>
#include <regex>
#include <string>

int main() {
	boost::locale::generator locGen;
	const std::locale loc = locGen("en_US.UTF-8");
	std::locale testLoc = std::locale(std::locale::classic(), loc, std::locale::collate);
	std::locale::global(testLoc);
	auto pattern = std::regex(R"(^(?:\s)*([_[:alnum:].-]+)\s*=\s*([^;#\n\r]+)*)");
// 	pattern.imbue();
	const std::string text{"  pozo = mani"};
	std::smatch result;
	std::regex_search(text, result, pattern);
	std::cout << "ready: " << result.ready() << ", size: " << result.size() << std::endl;
	for (size_t i=0; i < result.size(); i++) {
		std::cout << "match[" <<i<<"]: " << result[i] <<std::endl;
	}
	return 0;
}

That already doesn't work.

Lord-Kamina avatar Dec 19 '24 19:12 Lord-Kamina

Of note, this does not seem to happen with gcc and libstdc++. I have not yet tried mixing clang with libstc++ nor gcc with libc++.

Lord-Kamina avatar Dec 19 '24 21:12 Lord-Kamina

in #35 it is also reported to fail with libc++. Also

C and POSIX work always ok. Every locale is affected: Even en_US.UTF-8.

Seemingly collation_facet is the culprit which was reported there and your example suggests the same.

Flamefire avatar Jan 12 '25 18:01 Flamefire

I was able to reproduce this on Linux with Clang 14.0.0 and libc++ but not with Clang 15.0.7

The minimal reproducer seems to be matching R"(\s=)" against " =". Using a space instead of \s or a letter or number instead of the equals sign succeeds.

Edit: There seems to be a bug in libc++:

#include <boost/locale.hpp>
#include <iostream>
#include <locale>
#include <regex>
#include <string>

int main() {
	boost::locale::generator locGen;
	const std::locale loc = locGen("en_US.UTF-8");
    //std::locale::global(loc); // Uncomment this and it will fail
    std::regex pattern;
 	pattern.imbue(loc);
    pattern = R"(\s=)";
	const std::string text{" ="};
	std::smatch result;
	std::regex_search(text, result, pattern);
	std::cout << "ready: " << result.ready() << ", size: " << result.size() << std::endl;
	for (size_t i=0; i < result.size(); i++) {
		std::cout << "match[" <<i<<"]: " << result[i] <<std::endl;
	}
	return !result.size();
}

The issue doesn't happen when using imbue but only when changing the global locale.

Flamefire avatar Jan 13 '25 12:01 Flamefire

I commented in https://github.com/llvm/llvm-project/issues/39399 as I suspect this is a bug in libc++ which doesn't expect the behavior of the collation facet of Boost.Locale. Let's see what they say about that.

There certainly is a bug in libc++ as imbue does not have an effect on the matching, only the global locale matters.

Flamefire avatar Jan 17 '25 10:01 Flamefire