spirit icon indicating copy to clipboard operation
spirit copied to clipboard

Boost Spirit X3 Parser recursive

Open Hackman1993 opened this issue 3 years ago • 13 comments

I'm trying to write a parser to parse html with boost spirit x3, and I wrote parsers below:

Sorry about the long code, I've put them together so you can understand what I'm thinking。 The problem is these code can't compile. Error is :

fatal error C1202: recursive type or function dependency context too complex

I know this error comes out because of my parser html_element_ references tag_block_, and tag_block_ references html_element_, but I don't know how to make it work.

#include <boost/spirit/home/x3.hpp>
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/spirit/home/x3/support/ast/position_tagged.hpp>
#include <boost/spirit/home/x3/support/ast/variant.hpp>
#include <iostream>
using namespace boost::spirit::x3;
struct tag_name{};
struct html_tag;
struct html_comment;
struct attribute_data : boost::spirit::x3::position_tagged {
  std::string name;
  boost::optional<std::string> value;
};


struct tag_header :  boost::spirit::x3::position_tagged {
  std::string name;
  std::vector<attribute_data> attributes;
};

struct self_tag: boost::spirit::x3::position_tagged {
  tag_header header;
};

struct html_element : boost::spirit::x3::position_tagged, boost::spirit::x3::variant< std::string, self_tag, boost::recursive_wrapper<html_tag>>{
  using base_type::base_type;
  using base_type::operator=;
};



struct html_tag: boost::spirit::x3::position_tagged {
  tag_header header;
  std::vector<html_element> children;
};

BOOST_FUSION_ADAPT_STRUCT(attribute_data, name, value);
BOOST_FUSION_ADAPT_STRUCT(tag_header, name, attributes);
BOOST_FUSION_ADAPT_STRUCT(self_tag, header);
BOOST_FUSION_ADAPT_STRUCT(html_tag,header,children);

// These are the attributes parser, seems fine
struct attribute_parser_id;
auto attribute_identifier_= rule<attribute_parser_id, std::string>{"AttributeIdentifier"} = lexeme[+(char_ - char_(" /=>"))];
auto attribute_value_= rule<attribute_parser_id, std::string>{"AttributeValue"} =
                           lexeme["\"" > +(char_ - char_("\"")) > "\""]|lexeme["'" > +(char_ - char_("'")) > "'"]|
                           lexeme[+(char_ - char_(" />"))];
auto single_attribute_ = rule<attribute_parser_id, attribute_data>{"SingleAttribute"} = attribute_identifier_ > -("=">  attribute_value_);
auto attributes_ = rule<attribute_parser_id, std::vector<attribute_data>>{"Attributes"} = (*single_attribute_);


struct tag_parser_id;


auto tag_name_begin_func = [](auto &ctx){
  get<tag_name>(ctx) = _attr(ctx).name;
  //_val(ctx).header.name = _attr(ctx);
  std::cout << typeid(_val(ctx)).name() << std::endl;

};
auto tag_name_end_func = [](auto &ctx){
  _pass(ctx) = get<tag_name>(ctx) == _attr(ctx);
};

auto self_tag_name_action = [](auto &ctx){
  _val(ctx).header.name = _attr(ctx);
};
auto self_tag_attribute_action = [](auto &ctx){
  _val(ctx).header.attributes = _attr(ctx);
};

auto inner_text = lexeme[+(char_-'<')];
auto tag_name_ = rule<tag_parser_id, std::string>{"HtmlTagName"} = lexeme[*(char_ - char_(" />"))];
auto self_tag_ = rule<tag_parser_id, self_tag>{"HtmlSelfTag"} = '<' > tag_name_[self_tag_name_action] > attributes_[self_tag_attribute_action] > "/>";
auto tag_header_ = rule<tag_parser_id, tag_header>{"HtmlTagBlockHeader"} = '<' > tag_name_ > attributes_ > '>';

rule<tag_parser_id, html_tag> tag_block_;

rule<tag_parser_id, html_element> html_element_ = "HtmlElement";

auto tag_block__def = with<tag_name>(std::string())[tag_header_[tag_name_begin_func] > (*html_element_) > "</" > omit[tag_name_[tag_name_end_func]] > '>'];
auto html_element__def = inner_text | self_tag_ | tag_block_ ;

BOOST_SPIRIT_DEFINE(tag_block_, html_element_);
int main()
{
  std::string source = "<div data-src=\"https://www.google.com\" id='hello world'></div>";
  html_element result;
  auto const parser = html_element_;
  auto parse_result = phrase_parse(source.begin(), source.end(), parser, ascii::space, result);
}

Hackman1993 avatar Dec 30 '22 16:12 Hackman1993

Study the examples. There are lots of examples with such recursion. You might also want to break things up in smaller units in separate TUs. Again, there are examples you can study.

djowel avatar Dec 31 '22 00:12 djowel

Study the examples. There are lots of examples with such recursion. You might also want to break things up in smaller units in separate TUs. Again, there are examples you can study.

Ty for the reply,I've read and try examples in past few days. In X3 examples, parser can be cut in to 3 parts, so reference chain can be A -> B ->C ->A. I thingk that's why the example won't trigger this problem. But in my case, In order to match end tag and begin tag, I have to use with to provide a context so I can do the match later, so I can't cut the parser. That's cause my reference chain like A -> B -> A. I think this is the reason makes templates too complex(if I'm right) . But In my case I don't know how to break it?

Hackman1993 avatar Dec 31 '22 02:12 Hackman1993

Take a look at the mini_xml in the Qi examples. It makes use of inherited attributes, which is alas not available in X3. You might want to just use Qi. It is rather tricky to do in X3 without inherited attributes.

djowel avatar Dec 31 '22 03:12 djowel

Take a look at the mini_xml in the Qi examples. It makes use of inherited attributes, which is alas not available in X3. You might want to just use Qi. It is rather tricky to do in X3 without inherited attributes.

This question was also posted on stack overflow where sehe gave his usual excellent answer using the context to store a stack of strings for the tags.

cppljevans avatar Dec 31 '22 05:12 cppljevans

This question was also posted on stack overflow where sehe gave his usual excellent answer using the context to store a stack of strings for the tags.

Wonderful! I was about to give a follow answer on that. You beat me to the punch.

djowel avatar Jan 01 '23 16:01 djowel

Yeah, thanks to sehe on this part, and you may noticed that semantic actions may broke attribute propagation. Does it designed to be like that or it's a bug?

Hackman1993 avatar Jan 04 '23 05:01 Hackman1993

Take a look at the mini_xml in the Qi examples. It makes use of inherited attributes, which is alas not available in X3. You might want to just use Qi. It is rather tricky to do in X3 without inherited attributes.

This question was also posted on stack overflow where sehe gave his usual excellent answer using the context to store a stack of strings for the tags.

I have to confess I did not try to compile the code, but when I did today, I got a compile error:

C:/msys64/home/evansl/prog_dev/boost.org/1_80_0download/download/boost/spirit/home/x3/support/traits/container_traits.hpp:74:56: error: no type named 'value_type' in 'Ast::html_element'
      : detail::remove_value_const<typename Container::value_type>
                                   ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~

@Hackman1993, were you able to successfully compile sehe's code?

cppljevans avatar Jan 04 '23 18:01 cppljevans

Take a look at the mini_xml in the Qi examples. It makes use of inherited attributes, which is alas not available in X3. You might want to just use Qi. It is rather tricky to do in X3 without inherited attributes.

This question was also posted on stack overflow where sehe gave his usual excellent answer using the context to store a stack of strings for the tags.

I have to confess I did not try to compile the code, but when I did today, I got a compile error:

C:/msys64/home/evansl/prog_dev/boost.org/1_80_0download/download/boost/spirit/home/x3/support/traits/container_traits.hpp:74:56: error: no type named 'value_type' in 'Ast::html_element'
      : detail::remove_value_const<typename Container::value_type>
                                   ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~

@Hackman1993, were you able to successfully compile sehe's code?

OOPS, never mind. I was using wrong include flags. Sorry for noise :(

cppljevans avatar Jan 04 '23 18:01 cppljevans

Take a look at the mini_xml in the Qi examples. It makes use of inherited attributes, which is alas not available in X3. You might want to just use Qi. It is rather tricky to do in X3 without inherited attributes.

This question was also posted on stack overflow where sehe gave his usual excellent answer using the context to store a stack of strings for the tags.

I have to confess I did not try to compile the code, but when I did today, I got a compile error:

C:/msys64/home/evansl/prog_dev/boost.org/1_80_0download/download/boost/spirit/home/x3/support/traits/container_traits.hpp:74:56: error: no type named 'value_type' in 'Ast::html_element'
      : detail::remove_value_const<typename Container::value_type>
                                   ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~

@Hackman1993, were you able to successfully compile sehe's code?

OOPS, never mind. I was using wrong include flags. Sorry for noise :(

Aha, that's always happens.spirit's variant and boost's variant is not equal.

Hackman1993 avatar Jan 05 '23 00:01 Hackman1993

Take a look at the mini_xml in the Qi examples. It makes use of inherited attributes, which is alas not available in X3. You might want to just use Qi. It is rather tricky to do in X3 without inherited attributes.

This question was also posted on stack overflow where sehe gave his usual excellent answer using the context to store a stack of strings for the tags.

I have to confess I did not try to compile the code, but when I did today, I got a compile error:

C:/msys64/home/evansl/prog_dev/boost.org/1_80_0download/download/boost/spirit/home/x3/support/traits/container_traits.hpp:74:56: error: no type named 'value_type' in 'Ast::html_element'
      : detail::remove_value_const<typename Container::value_type>
                                   ~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~

@Hackman1993, were you able to successfully compile sehe's code?

OOPS, never mind. I was using wrong include flags. Sorry for noise :(

Aha, that's always happens.spirit's variant and boost's variant is not equal.

Hi Hackman1993,

The problem with my code was more involved than just #include'ing the wrong variant. I had used code with LOTS of debug print modifications, and, apparently, one of those modifications was wrong.

Nevertheless, I've attached a much simplified code which should make it easier, I hope, to explore how to avoid using semantic actions. You'll soon note that the only primitive parser used in uint_. IOW, there's no strings parsed, the only parsers used only detect the <X> and <X/> and then just Y where all of X and Y are uint_. There's also debug prints from the semantic actions.

Here's the output from the run:

-*- mode: compilation; default-directory: "~/prog_dev/boost.org/boost.exploratory/boost.replacements/examples/realworld/issue749/" -*-
Compilation started at Sun Jan  8 13:24:13

make -k --always-make run
clang++ -c -O0 -g -ggdb -ftemplate-backtrace-limit=0  -std=c++2b -ftemplate-backtrace-limit=0 -fdiagnostics-show-template-tree -fno-elide-type -fmacro-backtrace-limit=0 -fexceptions -fcxx-exceptions   -I/home/evansl/prog_dev/boost.org/boost.exploratory/boost.additions -I/home/evansl/prog_dev/boost.org/boost.exploratory/boost.replacements/common  -I/home/evansl/prog_dev/boost.org/1_80_0download/download -DUSE_SEMANTIC_ACTIONS -DUSE_RULE_DEFINITIONS  -DBOOST_NO_CXX98_FUNCTION_BASE -DUSE_SEMANTIC_ACTIONS -DUSE_RULE_DEFINITIONS    -ftemplate-depth=200  coliru_sehe-debug-simple.cpp -MMD -o c:/msys64/tmp/build/clangxx14_0_3/boost.org/boost.exploratory/boost.replacements/examples/realworld/issue749-80/coliru_sehe-debug-simple.o 
clang++    c:/msys64/tmp/build/clangxx14_0_3/boost.org/boost.exploratory/boost.replacements/examples/realworld/issue749-80/coliru_sehe-debug-simple.o   -o c:/msys64/tmp/build/clangxx14_0_3/boost.org/boost.exploratory/boost.replacements/examples/realworld/issue749-80/coliru_sehe-debug-simple.exe
c:/msys64/tmp/build/clangxx14_0_3/boost.org/boost.exploratory/boost.replacements/examples/realworld/issue749-80/coliru_sehe-debug-simple.exe 
USE_SEMANTIC_ACTIONS=yes;
USE_RULE_DEFINITIONS=yes;
{tag=html_element_ pass;
{tag=tag_name_begin_func;
:tag_beg=0;
}tag=tag_name_begin_func;
{tag=tag_name_end_func;
:tag_beg=0;
:tag_end=0;
:tags_matched=true;
}tag=tag_name_end_func;
"<0></0>" -> Ok
{tag=tag_name_begin_func;
:tag_beg=0;
}tag=tag_name_begin_func;
{tag=tag_name_end_func;
:tag_beg=0;
:tag_end=0;
:tags_matched=true;
}tag=tag_name_end_func;
"<0>0</0>" -> Ok
{tag=tag_name_begin_func;
:tag_beg=0;
}tag=tag_name_begin_func;
{tag=tag_name_begin_func;
:tag_beg=1;
}tag=tag_name_begin_func;
{tag=tag_name_end_func;
:tag_beg=1;
:tag_end=1;
:tags_matched=true;
}tag=tag_name_end_func;
{tag=tag_name_begin_func;
:tag_beg=2;
}tag=tag_name_begin_func;
{tag=tag_name_end_func;
:tag_beg=2;
:tag_end=2;
:tags_matched=true;
}tag=tag_name_end_func;
{tag=tag_name_end_func;
:tag_beg=0;
:tag_end=0;
:tags_matched=true;
}tag=tag_name_end_func;
"<0><1>11</1><2>22</2></0>" -> Ok
}tag=html_element_ pass;
{tag=html_element_ fail;
{tag=tag_name_begin_func;
:tag_beg=0;
}tag=tag_name_begin_func;
{tag=tag_name_end_func;
:tag_beg=0;
:tag_end=1;
:tags_matched=false;
}tag=tag_name_end_func;
Fails as expected: "<0></1>"
{tag=tag_name_begin_func;
:tag_beg=0;
}tag=tag_name_begin_func;
{tag=tag_name_end_func;
:tag_beg=0;
:tag_end=1;
:tags_matched=false;
}tag=tag_name_end_func;
Fails as expected: "<0>999</1>"
}tag=html_element_ fail;

Compilation finished at Sun Jan  8 13:24:16

and here's the code:

//OriginalSource:
//  http://coliru.stacked-crooked.com/a/b4de81a919bec94a
//LinkedToFrom:
//  https://stackoverflow.com/questions/74963861/parse-html-with-boost-spirit-x3
//===============
#include <iomanip>
#include <iostream>
#include <string>
//#define ISSUE749_EXE_TRACE
#ifdef ISSUE749_EXE_TRACE
  #pragma push_macro("FILE_SHORT")
  #define FILE_SHORT "issue749/coliru_sehe.cpp"
  #include <boost/iostreams/utility/templ_expr/demangle_fmt_type.hpp>
  #include <boost/utility/trace_scope.hpp>
  using boost::trace_scope;
#else
  struct trace_scope
    {
        std::string tag;
        trace_scope(std::string tag_)
          : tag(tag_)
          {  std::cout<<"{tag="<<tag<<";\n";
          }
        ~trace_scope()
          {  std::cout<<"}tag="<<tag<<";\n";
          }
    };
#endif//ISSUE749_EXE_TRACE

//#define BOOST_SPIRIT_X3_DEBUG
#include <boost/fusion/include/adapt_struct.hpp>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/ast/variant.hpp>
#include <stack>

namespace x3 = boost::spirit::x3;
using namespace std::string_literals;

namespace Ast {
    struct html_tag;

    struct tag_header {
        unsigned int     name;
    };

    using html_element =
        x3::variant
        < unsigned int
        , boost::recursive_wrapper<html_tag>
        >;
        
    using html_elements = std::vector<html_element>;

    struct html_tag {
        tag_header    header;
        html_elements children;
    };
} // namespace Ast

BOOST_FUSION_ADAPT_STRUCT(Ast::tag_header, name)
BOOST_FUSION_ADAPT_STRUCT(Ast::html_tag, header, children)

namespace Parser {
      struct 
    tag_stack 
      {};

  //#define USE_RULE_DEFINITIONS
  //#define USE_SEMANTIC_ACTIONS
  #ifdef USE_SEMANTIC_ACTIONS

    auto tag_name_begin_func = [](auto& ctx) 
      { 
        trace_scope ts("tag_name_begin_func");
        auto& tag_stk = get<tag_stack>(ctx);
        auto& tag_beg=_attr(ctx).name;
        std::cout<<":tag_beg="<<tag_beg<<";\n";
        tag_stk.push(tag_beg);
      };

    auto tag_name_end_func = [](auto& ctx) 
      {
        trace_scope ts("tag_name_end_func");
        auto& tag_stk = get<tag_stack>(ctx);
        auto& tag_beg = tag_stk.top();
        std::cout<<":tag_beg="<<tag_beg<<";\n";
        auto tag_end = _attr(ctx);
        std::cout<<":tag_end="<<tag_end<<";\n";
        auto tags_matched=tag_beg == tag_end;
        std::cout<<":tags_matched="<<tags_matched<<";\n";
        _pass(ctx) = tags_matched;
        tag_stk.pop();
      };
  #endif//USE_SEMANTIC_ACTIONS
  
      auto 
    tag_name_
      = x3::rule<struct HtmlTagName_tag, unsigned int>{"HtmlTagName"}
      = x3::uint_
      ;
      auto 
    tag_header_
    #ifdef USE_RULE_DEFINITIONS
      = x3::rule<struct HtmlTagBlockHeader_tag, Ast::tag_header>{"HtmlTagBlockHeader"} 
    #endif
      = '<' 
      >> tag_name_ 
      >> '>'
      ;

      x3::rule<struct tag_block__tag, Ast::html_tag>
    tag_block_    = "TagBlock"
      ;
      x3::rule<struct html_element__tag, Ast::html_element> 
    html_element_ = "HtmlElement"
      ;
      auto 
    tag_block__def =
        tag_header_
      #ifdef USE_SEMANTIC_ACTIONS
        [tag_name_begin_func]
      #endif
      >> *html_element_ 
      >> "</" 
      >> x3::omit
        [ tag_name_
        #ifdef USE_SEMANTIC_ACTIONS
          [tag_name_end_func]
        #endif
        ] 
      >> '>'
      ;
    auto inner_text        = x3::uint_;
    auto html_element__def = 
        inner_text 
      | tag_block_
      ;
    BOOST_SPIRIT_DEFINE(tag_block_, html_element_)
}

namespace unit_tests {
    template <bool ShouldSucceed = true, typename P>
    void test(P const& rule, std::initializer_list<std::string_view> cases) {
        auto start=
        #ifdef USE_SEMANTIC_ACTIONS
          x3::with<Parser::tag_stack>(std::stack<unsigned int>{})
          [
        #endif
            rule
        #ifdef USE_SEMANTIC_ACTIONS
          ]
        #endif
          ;      
        for (auto input : cases) {
            if constexpr (ShouldSucceed) {
                typename x3::traits::attribute_of<P, x3::unused_type>::type attr_of;
              #ifdef ISSUE749_EXE_TRACE
                std::cout<<":attr_of=\n"<<demangle_fmt_type(attr_of)<<";\n";
              #endif
                auto ok = phrase_parse(input.begin(), input.end(), start, x3::space, attr_of);
                std::cout << quoted(input) << " -> " << (ok ? "Ok" : "FAILED") << std::endl;
            } else {
                auto ok = phrase_parse(input.begin(), input.end(), start, x3::space);
                if (!ok)
                    std::cout << "Fails as expected: " << quoted(input) << std::endl;
                else
                    std::cout << "SHOULD HAVE FAILED: " << quoted(input) << std::endl;
            }
        }
    }
}

int main() {
  #if 1 && defined(ISSUE749_EXE_TRACE)
      boost::iostreams::indent_scoped_ostreambuf<char>
    indent_outbuf(std::cout,2);
  #endif
    std::cout<<std::boolalpha;
    std::cout<<"USE_SEMANTIC_ACTIONS=";
  #ifdef USE_SEMANTIC_ACTIONS
    std::cout<<"yes";
  #else
    std::cout<<"not";
  #endif
    std::cout<<";\n";
    std::cout<<"USE_RULE_DEFINITIONS=";
  #ifdef USE_RULE_DEFINITIONS
    std::cout<<"yes";
  #else
    std::cout<<"not";
  #endif
    std::cout<<";\n";
  #if 0
    { 
        trace_scope ts("tag_name_");
        unit_tests::test
        ( Parser::tag_name_
        , { R"(abc)"
          , R"(def)"
          }
        );
    }
  #endif
  #if 1
    {
        trace_scope ts("html_element_ pass");
        unit_tests::test
        ( Parser::html_element_
        , { R"(<0></0>)"
        #if 1
          , R"(<0>0</0>)"
          , R"(<0><1>11</1><2>22</2></0>)"
        #endif
          }
        );
    }
  #endif
  #if 1
    {
        trace_scope ts("html_element_ fail");
        unit_tests::test<false>
        ( Parser::html_element_
        , { R"(<0></1>)"
          , R"(<0>999</1>)"
          }
        );
    }
  #endif                            
}
#ifdef ISSUE749_EXE_TRACE
  #pragma pop_macro("FILE_SHORT")
#endif//ISSUE749_EXE_TRACE

Joel, maybe you could modify the code to avoid the semantic actions?

-Larry

cppljevans avatar Jan 08 '23 19:01 cppljevans

I think it's hard to solve without semantic actions in this problem .In order to match the begin tag, we always have to match the begin tag name, Spirit won't do that for us unless we got a built-in support,. As Joel said , using Spirit Qi is more tricky. but even in Qi's mini_xml example , Joel still using a semantic action to do that. So I'm going to stick on the semantic action way and stop digging deeper.

Hackman1993 avatar Jan 09 '23 07:01 Hackman1993

I'm with sehe: just use an html parser. mini_xml is just an example. But if you really want to do it using Spirit, I'd just collect the parse-tree structure in the parsing step and do the tag matching in a post processing, semantic analysis step by walking the generated tree from the prior parsing step.

djowel avatar Jan 09 '23 10:01 djowel

I think it's hard to solve without semantic actions in this problem .In order to match the begin tag, we always have to match the begin tag name, Spirit won't do that for us unless we got a built-in support,. As Joel said , using Spirit Qi is more tricky. but even in Qi's mini_xml example , Joel still using a semantic action to do that. So I'm going to stick on the semantic action way and stop digging deeper.

@Hackman1993, the code here doesn't use semantic actions. Instead, it uses new make_transform_parser and transform_attribute_tagged. transform_attribute_tagged is pretty close to the transform_attribute template in existing nonterminal/detail/transform_attribute.hpp. The main difference is that post returns bool instead of void, and it's there that the check is done for equality of beg/end tags.

Or you could use the pattern shown here which allows using either semantic action or the new make_transform_parser.hpp. The former pollutes the Parser namespace, that latter does not.

cppljevans avatar Jan 24 '23 00:01 cppljevans