Significant performance degradation of Poco::DateTimeParser
Consider the following code:
#include <Poco/DateTimeFormat.h>
#include <Poco/DateTimeFormatter.h>
#include <Poco/DateTimeParser.h>
#include <stdexcept>
std::string toString(const Poco::DateTime& dt)
{
return Poco::DateTimeFormatter::format(dt, Poco::DateTimeFormat::ISO8601_FRAC_FORMAT);
}
Poco::DateTime parsePoco(const std::string& text)
{
int tzd = 0;
const auto format = Poco::DateTimeFormat::ISO8601_FRAC_FORMAT;
auto parsed = Poco::DateTimeParser::parse(format, text, tzd);
parsed.makeUTC(tzd);
return parsed;
}
void ASSERT_EQ(const std::string& a, const std::string& b)
{
if (a != b)
{
throw std::runtime_error("Test FAILED");
}
}
int main()
{
for (int i = 0; i < 100'000; ++i)
{
auto parsed = parsePoco("2024-06-21T09:43:11.340159Z");
ASSERT_EQ(toString(parsed), "2024-06-21T09:43:11.340159Z");
parsed = parsePoco("2024-06-21T09:43:11.340159+07:00");
ASSERT_EQ(toString(parsed), "2024-06-21T02:43:11.340159Z");
}
return 0;
}
Execution time (Release configuration):
Poco 1.12.5: 96 ms
Poco 1.13.3: 17179 ms
With the format Poco::DateTimeFormat::ISO8601_FRAC_FORMAT + " " (with an additional space) execution time becomes better for 1.13.3.
I guess the reason is the following commit: https://github.com/pocoproject/poco/commit/4f1cf683079cc3b00be96259fc42f5c291ab4c77 (https://github.com/pocoproject/poco/pull/4330).
Up to 8 regex expressions are compiled on each call
https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeFormat.cpp#L153-L156
if the provided format is known:
https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeParser.cpp#L47
Yeah, well - it's either that or we silently accept garbage, as we used to. If you have an improvement proposal, send a PR
bool DateTimeFormat::isValid(const std::string& dateTime)
{
static const RegularExpression regs[] = {
RegularExpression(DateTimeFormat::ISO8601_REGEX),
RegularExpression(DateTimeFormat::RFC822_REGEX),
RegularExpression(DateTimeFormat::RFC1123_REGEX),
RegularExpression(DateTimeFormat::HTTP_REGEX),
RegularExpression(DateTimeFormat::RFC850_REGEX),
RegularExpression(DateTimeFormat::RFC1036_REGEX),
RegularExpression(DateTimeFormat::ASCTIME_REGEX),
RegularExpression(DateTimeFormat::SORTABLE_REGEX)
};
for (const auto& f : regs)
{
if (f.match(dateTime)) return true;
}
return false;
}
Creating the RegularExpressions once in isValid takes it from over 12s to .16s for me.
Why are all regexes checked, and not just one? E.g. for the following code
const auto format = Poco::DateTimeFormat::ISO8601_FRAC_FORMAT;
auto parsed = Poco::DateTimeParser::parse(format, text, tzd);
I would expect that only ISO8601_REGEX will be checked.
And why is user input not checked against these regexes if DateTimeFormat::hasFormat(fmt) is false?
About handling garbage input. Let's consider the following code:
https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeParser.cpp#L108-L110
https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeParser.cpp#L25-L26
https://github.com/pocoproject/poco/blob/4f1cf683079cc3b00be96259fc42f5c291ab4c77/Foundation/src/DateTimeParser.cpp#L37-L38
Points for discussion:
SKIP_JUNKaccepts garbage inputPARSE_NUMBER_Ndoesn't report an error if the input doesn't contain a number at all
I hope the parser body can be improved so that it is more strict and regular expressions can be avoided.
@andrewauclair can you look at this issue and propose improvements?