ctags Cxx: extract reference tags

See also https://github.com/universal-ctags/ctags/pull/569.

Signed-off-by: Masatake YAMATO [email protected]

TODO:

[x] evaluate reference tags on Citre
[ ] update manual pages
[ ] merge #569
[ ] talk to geany people about adding Y to C parser

Nov 15 '22 02:11 masatake

Codecov Report

Patch coverage: 95.96% and project coverage change: +0.05 :tada:

Comparison is base (7e5c8dc) 82.83% compared to head (ef7e809) 82.88%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3535      +/-   ##
==========================================
+ Coverage   82.83%   82.88%   +0.05%     
==========================================
  Files         226      227       +1     
  Lines       54754    54870     +116     
==========================================
+ Hits        45353    45478     +125     
+ Misses       9401     9392       -9

Impacted Files	Coverage Δ
parsers/cxx/cxx_parser_namespace.c	`80.29% <ø> (ø)`
parsers/cxx/cxx_parser_template.c	`74.71% <ø> (ø)`
parsers/cxx/cxx_token_chain.c	`82.41% <ø> (ø)`
parsers/cxx/cxx_reftag.c	`91.17% <91.17%> (ø)`
parsers/cxx/cxx_parser_variable.c	`92.30% <96.42%> (+0.61%)`	:arrow_up:
parsers/cxx/cxx_parser.c	`86.60% <100.00%> (ø)`
parsers/cxx/cxx_parser_function.c	`93.09% <100.00%> (+<0.01%)`	:arrow_up:
parsers/cxx/cxx_parser_lambda.c	`93.33% <100.00%> (ø)`
parsers/cxx/cxx_parser_tokenizer.c	`92.60% <100.00%> (+0.12%)`	:arrow_up:
parsers/cxx/cxx_parser_typedef.c	`92.90% <100.00%> (ø)`
... and 3 more

... and 1 file with indirect coverage changes

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Nov 15 '22 02:11 codecov[bot]

For the latest linux kernel source, ctags with this pull request generates 8GB tags file. It is not small, however, citre works fine. The option I used:

time ~/bin/ctags -o ~/.citre/upstream-linux.tags --kinds-all='*' --fields='*' --fields-all='*' --extras='*' --extras-all='*' --roles-all.'*'='*' -R ~/var/linux

It takes about 10 minutes for generating a tags file. I used a PC with 256GB RAM + HDD.

Citre works fine but I think we can improve UI for reference tags.

This unknown kind +ref role approach can be applied to all token-based parsers like python and javascript parsers.

Nov 15 '22 05:11 masatake

I pressed M-. on a definition, I expected reference tags are listed first.
I want more emphasized representation for ref tags. Just <R> is not enough.
in the future, roles must be displayed.

Nov 15 '22 05:11 masatake

No built-in parser users Y nor W as a kind letter. Let's use of one them instead of X for representing "unknown" kind.

Nov 15 '22 12:11 masatake

Citre works fine but I think we can improve UI for reference tags.

Citre now takes a pluggable backend design, citre-find-reference-backends supports the following functionalities:

xref-find-references
citre-jump-to-reference
citre-peek-references, citre-ace-peek-references, citre-peek-through-references

Currently the only "find reference" backend is global backend. I could create a tags backend.

Nov 18 '22 17:11 AmaiKinono

@AmaiKinono Thank you for your response.

I have used this branch + citre (87e2cbf3b2ae6d59ec919a2dcb38e56ccfa5ec14, a bit othreeder) for 3 days to read Linux kernel. M-. becomes slower maybe because readtags reports too many tags. This slowness is acceptable if the command I use is M-?. However, I would like to keep M-. faster.

The question is whether M-. of citre users becomes slower or not if I merge this branch today.

--roles-C.{unknown}=ref (--roles-all.{unknown}=ref, --roles-all.{unknown}=*, or --roles-all.*=*) is needed to make ctags emit the reference tags newly introduced in this pull request. Just adding --extras=+r --fields=+r is not enough. I intentionally made ctgas require the extra option so that users don't make too large tags file unexpectedly.

The instructions (https://github.com/universal-ctags/citre/blob/master/docs/user-manual/about-tags-file.md#create-informative-tags-file) about the way to run ctags, do not enable the "ref" role of "unknown" kind. So I can say merging this branch today doesn't get citre users in trouble.

Nov 18 '22 21:11 masatake

M-. becomes slower maybe because readtags reports too many tags.

The problem is until now, Citre assumes tags files don't contain "reference tags" in the general sense. u-ctags only emits reference tags that point to external modules/files (let's call them dependency references for now). They are useful and aren't too many, so Citre doesn't filter them out when finding definitions.

But after this PR, uctags emits reference tags pointing to general entities (let's call them general reference for now). Citre should exclude them when finding definitions, and include them when finding references. We could do this in 2 ways:

Only exclude general reference tags. This has the advantage of keeping M-. behavior of Citre unchanged. We could do this by excluding tags that have "unknown" kind and "ref" role, but if in the future, they are tagged with the correct kind (variable, function, etc.), this technique won't work anymore.
Exclude all reference tags. In this way, M-. on a module/file name won't list its references. You have to use M-? for that.

Personally I feel the second way is better.

The question is whether M-. of citre users becomes slower or not if I merge this branch today.

It won't become slower as you already pointed out.

P.S. Actually I think tagsfile is not the best format for references. If we create a tagline for every symbol in a codebase, the tagsfile could become larger than the codebase itself. A binary database (like GNU global does) could save much space. But this is only my own thoughts. It may be unrealistic to make necessary changes to uctags for solving this problem.

Nov 19 '22 00:11 AmaiKinono

@AmaiKinono You wrote what I thought but I could not write.

P.S. Actually I think tagsfile is not the best format for references. If we create a tagline for every symbol in a codebase, the tagsfile could become larger than the codebase itself. A binary database (like GNU global does) could save much space. But this is only my own thoughts. It may be unrealistic to make necessary changes to uctags for solving this problem.

I agree with you. The "space" is one of the biggest reasons why I have not implemented reference tags.

Nov 20 '22 03:11 masatake

let's call them dependency references for now let's call them general reference for now

Good ideas. These proposals help me when extending our man pages.

Nov 20 '22 14:11 masatake

Python may extract too many 'self'.

Nov 20 '22 14:11 masatake

EmacsLisp            u      unknown               yes     no      0      NONE   unknown type of definitions
Go                   u      unknown               yes     yes     1      NONE   unknown
Lisp                 u      unknown               yes     no      0      NONE   unknown type of definitions
Python               x      unknown               yes     no      2      NONE   name referring a class/variable/function/module defined in other module

Though breaking compatibilities, I will change the letters for unknown to Y.

Nov 29 '22 13:11 masatake

4615250..fe0c93a can be merged in advance.

Dec 03 '22 02:12 masatake

As I expected, the new idea, "assigning a cork index to a token", allows us to implement reference tags more than the "unknown" kind.

input1.c:

struct ops file_ops = {
    .readfn = file_readfn,
    .writefn = file_writefn,
};

output1.tags

ops ...         kind:unknown ...  roles:vardef
file_ops ...    kind:variable ... roles:def                ... typeref:struct:ops
readfn ...      kind:member ...   roles:initialized        ... scope:variable:file_ops
file_readfn ... kind:unknown ...  roles:value              ... scope:variable:file_ops
writefn ...     kind:member ....  roles:initialized        ... scope:variable:file_ops
file_writefn ...kind:unknown ...  roles:value              ... scope:variable:file_ops

Dec 03 '22 18:12 masatake

$ cat /tmp/foo.c
#include <stdio.h>
int main(void)
{
  return puts("hello, world\n");
}
$ ./ctags --kinds-C=+Y --roles-C.Y=+'{ref}'  --extras=+r --fields=+r -o - /tmp/foo.c
main	/tmp/foo.c	/^int main(void)$/;"	f	typeref:typename:int	roles:def
puts	/tmp/foo.c	/^  return puts("hello, world\\n");$/;"	Y	function:main	roles:applied
stdio.h	/tmp/foo.c	/^#include <stdio.h>/;"	h	roles:system

There are still many TODO items, ctags running locally successfully extracted puts, a macro expanded or a function called in main. See the "roles:" field of "puts". The scope was also filled. As I wrote ago, citre works well.

We should add the rule, "unknown+ref should have lower priority" to Citre's sorting engine.

Jan 03 '23 20:01 masatake

@AmaiKinono

When I looked up a rather generic name like "inode", M-. took very long time. For looking up reference tags, it is acceptable; I will just do C-g. However, when I was interested only in its definition, taking too long time was not acceptable.

So for evaluation of this pull request, I added some code to Citre:

diff --git a/citre-tags.el b/citre-tags.el
index 4ce793e..bdcc520 100644
--- a/citre-tags.el
+++ b/citre-tags.el
@@ -375,6 +375,7 @@ completion can't be done."
         (file-path (citre-get-property 'file-path symbol)))
     `(not
       (or
+       ,(citre-readtags-filter 'extras "reference" 'csv-contain)
        ;; Don't excluded "anonymous" here as such symbols can appear in typeref
        ;; or scope fields of other tags, which may be shown in an xref buffer,
        ;; so we should be able to find their definitions.
@@ -413,6 +414,40 @@ is found."
      :require '(name ext-abspath pattern)
      :optional '(ext-kind-full line typeref scope extras))))
 
+;;;; Find references
+
+(defun citre-tags-reference-default-filter (symbol)
+  (let ((tags-file (citre-get-property 'tags-file symbol))
+        (file-path (citre-get-property 'file-path symbol)))
+    `(not
+      (or
+       ;; Don't excluded "anonymous" here as such symbols can appear in typeref
+       ;; or scope fields of other tags, which may be shown in an xref buffer,
+       ;; so we should be able to find their definitions.
+       ,(if file-path
+            (citre-tags-filter-local-symbol-in-other-file file-path tags-file)
+          'false)))))
+
+(defvar citre-tags-reference-default-sorter
+  (citre-readtags-sorter
+   citre-tags-sorter-arg-put-references-below
+   'input '(length name +) 'name
+   citre-tags-sorter-arg-size-order))
+
+(defun citre-tags-get-references (&optional symbol tagsfile)
+  (when-let* ((symbol (or symbol (citre-tags-get-symbol tagsfile)))
+              (tagsfile (or tagsfile (citre-get-property 'tags-file symbol))))
+    (citre-tags-get-tags
+     tagsfile symbol 'exact
+     :filter (or (citre-tags--get-value-in-language-alist
+                  :reference-filter symbol)
+                 (citre-tags-reference-default-filter symbol))
+     :sorter (or (citre-tags--get-value-in-language-alist
+                  :reference-sorter symbol)
+                 citre-tags-reference-default-sorter)
+     :require '(name ext-abspath pattern)
+     :optional '(ext-kind-full line typeref scope extras))))
+
 ;;;; Completion backend
 
 (defvar citre-tags--completion-cache
@@ -483,7 +518,9 @@ The result is a list (BEG END TAGS), see
     (citre-tags-get-definitions symbol tagsfile)))
 
 (defvar citre-tags--find-definition-for-id-filter
-  `(not ,(citre-readtags-filter 'extras "anonymous" 'csv-contain))
+  `(not (or
+         ,(citre-readtags-filter 'extras "reference" 'csv-contain)         
+         ,(citre-readtags-filter 'extras "anonymous" 'csv-contain)))
   "Filter for finding definitions when the symbol is inputted by user.")
 
 (defvar citre-tags--id-list-cache
@@ -546,6 +583,16 @@ simple tag name matching.  This function is for it."
  :identifier-list-func #'citre-tags-get-identifiers
  :get-definitions-for-id-func #'citre-tags--get-definition-for-id)
 
+;;;; Find refernces backend
+(defun citre-tags-get-references-at-point ()
+  "Get definitions of symbol at point."
+  (when-let ((tagsfile (citre-tags-file-path))
+             (symbol (citre-tags-get-symbol)))
+    (citre-tags-get-references symbol tagsfile)))
+
+(citre-register-find-reference-backend
+ 'tags #'citre-tags-get-references-at-point)
+
 ;;;; Tags in buffer backend
 
 (declare-function tramp-get-remote-tmpdir "tramp" (vec))

Jan 22 '23 12:01 masatake

So for evaluation of this pull request, I added some code to Citre:

Thanks! This is really helpful.

May I ask you for a PR to Citre, or can I use your code directly?

Jan 22 '23 12:01 AmaiKinono

May I ask you for a PR to Citre, or can I use your code directly?

This is a quick hack. Can you please add a new code for M-?. I'd be happy if my code could be used as a base for the new code.

What I didn't add was code for 'C-u M-?'. I want to have it.

You wrote:

But after this PR, uctags emits reference tags pointing to general entities (let's call them general reference for now). Citre should exclude them when finding definitions, and include them when finding references. We could do this in 2 ways:

Only exclude general reference tags. This has the advantage of keeping M-. behavior of Citre unchanged. We could do this by excluding tags that have "unknown" kind and "ref" role, but if in the future, they are tagged with the correct kind (variable, function, etc.), this technique won't work anymore.

Exclude all reference tags. In this way, M-. on a module/file name won't list its references. You have to use M-? for that.

Personally I feel the second way is better.

I agree with you.

P.S. Actually I think tagsfile is not the best format for references. If we create a tagline for every symbol in a codebase, the tagsfile could become larger than the codebase itself. A binary database (like GNU global does) could save much space. But this is only my own thoughts. It may be unrealistic to make necessary changes to uctags for solving this problem.

I agree with you. However, I would like to use the tags output till I understand what I want.

Jan 22 '23 15:01 masatake

What I didn't add was code for 'C-u M-?'. I want to have it.

This one is hard.

When finding references of a user-inputted symbol, what xref does is get the symbol by xref-backend-identifier-completion-table, then pass it to xref-backend-references, with the text property stripped.

Citre is an xref backend and itself also has backends. Suppose a Citre backend A can give us all symbols in a project (so it could support xref-backend-identifier-completion-table), but doesn't support finding references, while another Citre backend B supports finding references of a symbol. When I press C-u M-?, Citre tries the backends in turn, and this could happen: we get the symbol name from backend A, and feed it to backend B to find the references. This is wrong, we should get the symbol from backend B.

Before passing a symbol to xref-backend-references, xref removes its text properties, so we could not know which Citre backend gives us the symbol. This makes designing a "xref backend with multiple backends" (like Citre) very hard.

This may be solved by some jump wire code but it needs some design... We are too away from the theme of this PR. Now I know what you want and I'll see how to implement it ;)

Jan 22 '23 16:01 AmaiKinono