gumbo-query
gumbo-query copied to clipboard
Trimming strings for advanced datasets
After some digging around I found a way to trim the unformatted strings (containing '\r', '\v', '\f', '\n', '\t', ' ') this library returns when parsing HTML files. For example a file with multiple spaces etc can be very annoying when you for example try to train a ML algortigh that gets data from libcurl. So this function 'reduce' will tranform the string:
You can modify the text in the box to the left any way you like, and ss
then click the "Show Page" button below the box to display the
result here. Go ahead and do this as often and as long as you like.
To something like this:
You can modify the text in the box to the left any way you like, and ss then click the "Show Page" button below the box to display the result here. Go ahead and do this as often and as long as you like.
The code:
std::string trim(
const std::string& str,
const std::string& whitespace = " \t \n \r \v \f"
){
const auto strBegin = str.find_first_not_of(whitespace);
if (strBegin == std::string::npos)
return ""; // no content
const auto strEnd = str.find_last_not_of(whitespace);
const auto strRange = strEnd - strBegin + 1;
return str.substr(strBegin, strRange);
}
std::string reduce(
const std::string& str,
const std::string& fill = " ",
const std::string& whitespace = " \t \n \r \v \f")
{
// trim first
auto result = trim(str, whitespace);
// replace sub ranges
auto beginSpace = result.find_first_of(whitespace);
while (beginSpace != std::string::npos)
{
const auto endSpace = result.find_first_not_of(whitespace, beginSpace);
const auto range = endSpace - beginSpace;
result.replace(beginSpace, range, fill);
const auto newStart = beginSpace + fill.length();
beginSpace = result.find_first_of(whitespace, newStart);
}
return result;
}
I go this from a reddit post, but it did not have an author.
This was the test HTML file:
<html>
<head>
<title>Something</title>
<style type="text/css">
</style>
</head>
<body bgcolor = "#ffffcc" text = "#000000">
<div id="ly-title">
<h1>Hello, World!</h1>
</div>
<div id="ly-body">
<p>
You can modify the text in the box to the left any way you like, and ss
then click the "Show Page" button below the box to display the
result here. Go ahead and do this as often and as long as you like.
</p>
<p>
You can also use this page to test your Javascript functions and local
style declarations. Everything you do will be handled entirely by your own
browser; nothing you type into the text box will be sent back to the
server.
</p>
</div>
</body>
</html>