ctags RFC: Using FQ name in scope field ( Asciidoc: structure isn't resembled )

The name of the parser: ctags

The command line you used to run ctags:

$ ctags --options=NONE -o - b.adoc

The content of input file:

= Title

Lorem Ipsum.

== a

Lorem Ipsum.

=== a.a

Lorem Ipsum.

==== a.a.a

Lorem Ipsum.

[[label-a.a.a]]

Lorem Ipsum.

===== a.a.a.a

Lorem Ipsum.

== b

Lorem Ipsum.

=== b.a

Lorem Ipsum.

==== b.a.a

Lorem Ipsum.

[[label-b.a.a]]

Lorem Ipsum.

===== b.a.a.a

Lorem Ipsum.

The tags output you are not satisfied with:

ctags: Notice: No options will be read from files or environment
Title   b.adoc  /^= Title$/;"   c
a       b.adoc  /^== a$/;"      s       chapter:Title
a.a     b.adoc  /^=== a.a$/;"   S       section:a
a.a.a   b.adoc  /^==== a.a.a$/;"        t       subsection:a.a
a.a.a.a b.adoc  /^===== a.a.a.a$/;"     T       subsubsection:a.a.a
b       b.adoc  /^== b$/;"      s       chapter:Title
b.a     b.adoc  /^=== b.a$/;"   S       section:b
b.a.a   b.adoc  /^==== b.a.a$/;"        t       subsection:b.a
b.a.a.a b.adoc  /^===== b.a.a.a$/;"     T       subsubsection:b.a.a
label-a.a.a     b.adoc  /^[[label-a.a.a]]$/;"   a       subsubsection:a.a.a
label-b.a.a     b.adoc  /^[[label-b.a.a]]$/;"   a       subsubsection:b.a.a
...

The tags output you expect:

ctags: Notice: No options will be read from files or environment
Title   b.adoc  /^= Title$/;"   c
a       b.adoc  /^== a$/;"      s       chapter:Title
a.a     b.adoc  /^=== a.a$/;"   S       section:Title""a
a.a.a   b.adoc  /^==== a.a.a$/;"        t       subsection:Title""a""a.a
a.a.a.a b.adoc  /^===== a.a.a.a$/;"     T       subsubsection:Title""a""a.a""a.a.a
b       b.adoc  /^== b$/;"      s       chapter:Title
b.a     b.adoc  /^=== b.a$/;"   S       section:Title""b
b.a.a   b.adoc  /^==== b.a.a$/;"        t       subsection:Title""b""b.a
b.a.a.a b.adoc  /^===== b.a.a.a$/;"     T       subsubsection:Title""b""b.a""b.a.a
label-a.a.a     b.adoc  /^[[label-a.a.a]]$/;"   a       subsubsection:a.a.a
label-b.a.a     b.adoc  /^[[label-b.a.a]]$/;"   a       subsubsection:b.a.a
...

The version of ctags:

$ ctags --version
Universal Ctags 0.0.0, Copyright (C) 2015 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: Mar 26 2019, 14:11:21
  URL: https://ctags.io/
  Optional compiled features: +wildcards, +regex, +iconv, +option-directory, +xpath, +json, +interactive, +sandbox, +yaml

How do you get ctags binary:

source package from Debian buster and added patch https://github.com/universal-ctags/ctags/pull/2062

I noticed the problem with the vim-tagbar. See https://github.com/majutsushi/tagbar/pull/529 for some screenshots of the problem. The tags aren't correctly structured. I compared it to the Latex output which does correctly resemble the structure.

Two decisions have to be made:

In the above "expected output" I used two double quotes to seperate the parts of the hierarchy (the rightmost column in the ctags output). I did this only because that is done with the Latex parser. Please decide if this the correct separator for asciidoc, too.
I am not sure about the anchors. In the Latex parser the are not put into the structure, but are separate. The current asciidoc parser puts them in the hierarchy below the section the are defined in. I am not sure what is the best approach. Maybe the asciidoc parser should do the same like the Latex parser?

Disclaimer: I am not used to ctags, so the above mentioned "expected output" may be wrong. I just wrote it based on what I saw in the Latex output. Please look closely, I may have done something wrong there.

Mar 26 '19 14:03 hupfdule

I wonder how scope fields (and full-qualified tags) should be. As far as I know, none say scope field should have full-qualified name. C and Python parsers, both are well-maintained modern parsers, fill scope fields with full-qualified names. PappetManifest parser does the same.

However, RestructuredText parser doesn't use full-qualified names.

Should all parsers use fq names for scope fields?

Introducing fqscope: is one of the idea. Or introducing hashnumber index is another idea.

def a:
  def b:
      def c:

a hash:3424242
b hash:1231123 scopeHash:342424
c hash:0234235 scopeHash:1231123

This allows a client tool to build perfect name tree. However, tags file becomes larger and unreadable.

Usecase in a language for documentation is a bit different from that for programming language.

I would like to get more comment about this topic from various people. Are there Geany developers here? From Geany side, do you have a comment or request about scope field? Using fqtag for scope field is better? If your answer is yes, what kind of scope separator we should use? If a client tool don't know scope separators used in the scope fields, I guess the tool cannot build a namespace tree. If we use fq-names in scope fields, separators used in the name must be agreed between ctags and client tool. If we don't use fq-name in scope fields, a client tool doesn't need such knowledges.

My original idea is using pseudo tags passing separator information from ctags to client tools. ...So many issues are arround here.

Mar 27 '19 10:03 masatake

As far as I can tell this issue also applies to the Markdown and RestructuredText parsers. Both seem to generate the same "flat" structure as the AsciiDoc parser.

a hash:3424242 b hash:1231123 scopeHash:342424 c hash:0234235 scopeHash:1231123

This seems a bit too unreadable to me. I may be well suited to be parsed by a machine, but I like the fact that the generated tags file is human readable.

Usecase in a language for documentation is a bit different from that for programming language.

I think that is important. Programming languages usually have a format for fqns, mostly by separating the parts by a dot. However such a thing does not exist in documentation languages. Therefore an artificial solution must be used for documentation languages.

If a client tool don't know scope separators used in the scope fields, I guess the tool cannot build a namespace tree.

I think so. For example in tagbar, you have to define that separator explicitly.

Since all documentation languages (AsciiDoc, Markdown, RestructuredText, Latex) are very similar in regards to document structure I would use the already existing one, Latex, as reference. The Latex parser uses two double-quotes as separator for fqns. And it is already working well.

It may not be the best solution, but if another solution is to be found for AsciiDoc, Markdown, RestructuredText, it should be applied to Latex, too, for consitency. However, that would mean a breaking change in the Latex parser.

Apr 01 '19 07:04 hupfdule

What you wrote is persuasive. Give me time.

Apr 01 '19 08:04 masatake

I'm working on markdown parser now.

Apr 09 '19 12:04 masatake

Filling scope fields with FQ names may be better than non-FQ names. FQ names in scope fields may help client tools to reconstruct a name tree for input.

Using "" is also good idea.

However, I found an interesting comment in asciidoc.c:

-			 * This doesn't use Cork, but in this case I think this is better,
-			 * because Cork would record the scopes of all parents in the chain
-			 * which is weird for text section identifiers, and also this is
-			 * what the rst.c reStructuredText parser does.
-			 */

Here, Cork is a name of API to build FQ name easily. A scope field with a fq name can be very long.

I paste diff solving this issue. I feel we need more feedback about using FQ name in the scope field.

[jet@living]~/var/ctags% git diff | cat
git diff | cat
diff --git a/parsers/asciidoc.c b/parsers/asciidoc.c
index ca6cb2a0..002dacef 100644
--- a/parsers/asciidoc.c
+++ b/parsers/asciidoc.c
@@ -45,18 +45,29 @@ typedef enum {
 	K_ANCHOR
 } asciidocKind;
 
+static scopeSeparator AsciidocSeparators [] = {
+	{ KIND_WILDCARD_INDEX, "\"\"" },
+};
+
 /*
  * The following kind letters are based on the markdown parser kinds,
  * and thus different than geany's.
  */
 static kindDefinition AsciidocKinds[] = {
-	{ true, 'c', "chapter",       "chapters"},
-	{ true, 's', "section",       "sections" },
-	{ true, 'S', "subsection",    "level 2 sections" },
-	{ true, 't', "subsubsection", "level 3 sections" },
-	{ true, 'T', "l4subsection",  "level 4 sections" },
-	{ true, 'u', "l5subsection",  "level 5 sections" },
-	{ true, 'a', "anchor",        "anchors" }
+	{ true, 'c', "chapter",       "chapters",
+	  ATTACH_SEPARATORS(AsciidocSeparators) },
+	{ true, 's', "section",       "sections",
+	  ATTACH_SEPARATORS(AsciidocSeparators) },
+	{ true, 'S', "subsection",    "level 2 sections",
+	  ATTACH_SEPARATORS(AsciidocSeparators) },
+	{ true, 't', "subsubsection", "level 3 sections",
+	  ATTACH_SEPARATORS(AsciidocSeparators) },
+	{ true, 'T', "l4subsection",  "level 4 sections",
+	  ATTACH_SEPARATORS(AsciidocSeparators) },
+	{ true, 'u', "l5subsection",  "level 5 sections",
+	  ATTACH_SEPARATORS(AsciidocSeparators) },
+	{ true, 'a', "anchor",        "anchors",
+	  ATTACH_SEPARATORS(AsciidocSeparators) },
 };
 
 static char kindchars[SECTION_COUNT]={ '=', '-', '~', '^', '+' };
@@ -109,16 +120,7 @@ static int makeAsciidocTag (const vString* const name, const int kind, const boo
 		}
 
 		if (parent && (parent->kindIndex < kind))
-		{
-			/*
-			 * This doesn't use Cork, but in this case I think this is better,
-			 * because Cork would record the scopes of all parents in the chain
-			 * which is weird for text section identifiers, and also this is
-			 * what the rst.c reStructuredText parser does.
-			 */
-			e.extensionFields.scopeKindIndex = parent->kindIndex;
-			e.extensionFields.scopeName = parent->name;
-		}
+			e.extensionFields.scopeIndex = nl->corkIndex;
 
 		r = makeTagEntry (&e);
 	}
@@ -405,8 +407,9 @@ extern parserDefinition* AsciidocParser (void)
 	def->patterns = patterns;
 	def->extensions = extensions;
 	def->parser = findAsciidocTags;
-	/* do we even need to use Cork? */
+
 	def->useCork = true;
+	def->requestAutomaticFQTag = true;
 
 	return def;
 }
[jet@living]~/var/ctags% cat input.adoc 
cat input.adoc 
= Title

Lorem Ipsum.

== a

Lorem Ipsum.

=== a.a

Lorem Ipsum.

==== a.a.a

Lorem Ipsum.

[[label-a.a.a]]

Lorem Ipsum.

===== a.a.a.a

Lorem Ipsum.

== b

Lorem Ipsum.

=== b.a

Lorem Ipsum.

==== b.a.a

Lorem Ipsum.

[[label-b.a.a]]

Lorem Ipsum.

===== b.a.a.a

Lorem Ipsum.
[jet@living]~/var/ctags% ./ctags -o - input.adoc
./ctags -o - input.adoc
Title	input.adoc	/^= Title$/;"	c
a	input.adoc	/^== a$/;"	s	chapter:Title
a.a	input.adoc	/^=== a.a$/;"	S	section:Title""a
a.a.a	input.adoc	/^==== a.a.a$/;"	t	subsection:Title""a""a.a
a.a.a.a	input.adoc	/^===== a.a.a.a$/;"	T	subsubsection:Title""a""a.a""a.a.a
b	input.adoc	/^== b$/;"	s	chapter:Title
b.a	input.adoc	/^=== b.a$/;"	S	section:Title""b
b.a.a	input.adoc	/^==== b.a.a$/;"	t	subsection:Title""b""b.a
b.a.a.a	input.adoc	/^===== b.a.a.a$/;"	T	subsubsection:Title""b""b.a""b.a.a
label-a.a.a	input.adoc	/^[[label-a.a.a]]$/;"	a	subsubsection:Title""a""a.a""a.a.a
label-b.a.a	input.adoc	/^[[label-b.a.a]]$/;"	a	subsubsection:Title""b""b.a""b.a.a

Apr 09 '19 16:04 masatake

Which parsers are needed to change to follow the policy "using FQ name in scope field"?

[x] Markdown
[ ] Html
[ ] Latex
[ ] RestructuredText
[ ] YAML?
[ ] XML?

Apr 09 '19 16:04 masatake

I think in general FQ scopes are better because they make it at least easier to build an appropriate tree, and sometimes it's the only option to do so. Most of the time it's actually possible to solve some ambiguities with looking at tag lines, as scope and line order mostly match, but it can't solve all cases.

For a machine POV your idea of scope unique identifiers, like IDs, is quite nice. But yeah I agree it's not very nice in a tags file because it's virtually impossible to decipher by a human.

Then, there are 2 questions: what separators should be used, and how to communicate that to the client tool (maybe that last one was solved already?). Anyway, for the separator, we need something that cannot appear in the tag itself, and that is allowed in the scope field in the tags format. I see two distinct cases:

As already pointed out, for programming languages it's fairly easy as usually there's already a scope separator in the language itself (. or :: in most cases) which can then be used and feel natural.
- nothing fancy to do here, just use the language's scope separator.
In text formats like AsciiDoc, reStructuredText, Markdown and more, it's however trickier as AFAIK anything is e.g. a valid title (and it should be, as it's plain text anyway).
- here I see 2 options, with varying degrees of robustness and ease of use:
  1. use a reasonably unlikely separator. In Geany we have been using ::: in reStructuredText and conffiles at some point. We've been using ASCII ETX (0x03) as well in the client side, mostly in case we didn't want to match anything ever (for parsers that didn't generate FQ scopes). I think using an ASCII control character like ETX (maybe FS, GS, RS or US) makes sense if it's valid in the tags file and we accept it's sufficiently unlikely in text files, which I think it is as they are not printable characters.
  2. The other solution would be to use whichever separator we like, and have a way to escape it when it's part of the name itself and not actually a separator. It's more robust, but more likely to cause compatibility issues. Maybe combining a highly unlikely separator plus escaping is the best, but maybe it's overkill as well.

Anyway, that's what I could gather on the subject from the top of my head.

Apr 25 '19 20:04 b4n

It seems that I didn't read @b4n's very valuable advice. I chose "" for documentation languages. @techee gave me the advise, too at #2716.

Dec 11 '20 04:12 masatake

ctags ctags copied to clipboard

RFC: Using FQ name in scope field ( Asciidoc: structure isn't resembled )

ctags
ctags copied to clipboard