bazel-compile-commands-extractor file based filter

https://github.com/hedronvision/bazel-compile-commands-extractor/issues/93

Usage

bazel run //:refresh_compile_commands -- --file=a/b/c.c

[ ] # TODO we should always be quoting targets when we splice them in https://github.com/xinzhengzhang/bazel-compile-commands-extractor/pull/1
[ ] # TODO --file https://github.com/xinzhengzhang/bazel-compile-commands-extractor/pull/1

# TODO: Let's detect out-of-bazel, absolute  paths 
# TODO full paths on windows
# TODO be careful about this case with --file

[ ] # TODO consider running a preliminary aquery to make this more specific https://github.com/xinzhengzhang/bazel-compile-commands-extractor/pull/1
[ ] # TODO simplify loop
[ ] # TODO for consistency, let's have this return a list
[ ] # TODO switch to --include_param_files
[ ] # TODO simplify/unify logic around warning about not being able to find commands * 3
[ ] # TODO for --file, don't continue traversing targets after the first command has been found

Dec 23 '22 15:12 xinzhengzhang

Header-only targets will still present a problem, I think, since they don't have any compile actions in the targets that list them in attrs. Any ideas on how we should solve this?

Although they cannot be found in the target action but they should be defined in the BUILD file in attr unless using something like build --spawn_strategy=standalone and it is not recommended. The only question is to find which attr and I think srcs and hers is enough in most case

attrs seems to operate on a stringified list of targets, rather than paths. Maybe we can use this to be more specific somehow, but if not, let's document a little better what we're doing. It's probably okay to accidentally update a few extra compile commands as as long as it remains fast enough for you, though, of course, being specific is ideal.

Since bazel query runs on the post-loading phase target graph it doesn't have enough information to get e.g. the real file path. So we have two ways

Write clearly in the doc, if it is a header file, you need to pass in the relative path starting from BUILD
Only query the file name, like now (It will contain possibly more targets but is easier to use）

Update in place: Let's put the logic that updates compile_commands.json with the new entries for the --file in this tool, rather than the plugin! That way it can be eventually shared across editor plugins and you don't have to deal with overwriting.

Do you think we need a flag to mark override or merge? Or only --file uses merge?

Dec 24 '22 10:12 xinzhengzhang

Not all (C++) headers end with .h. (Nor, indeed, any extension at all, sadly.) Suggestion: Flip the if statement the other way, checking if the file has a source extension, otherwise treating it as a header.

https://github.com/hedronvision/bazel-compile-commands-extractor/pull/99/commits/8e834df6ebecbed782134c39cade264ca4372125

I'm guessing that using a let variable definition for the header query is more efficient, saving a double evaluation of the target_statement, though I'm not certain.

https://github.com/hedronvision/bazel-compile-commands-extractor/pull/99/commits/4cac1f6e94ae860e9b5886ed4b29ff3016cfa36d

Let's put the logic that updates compile_commands.json with the new entries for the --file in this tool, rather than the plugin! That way it can be eventually shared across editor plugins and you don't have to deal with overwriting.

https://github.com/hedronvision/bazel-compile-commands-extractor/pull/99/commits/79cf4abd0e9933cbcdb825a1b14f2f510c5437d0

I think we probably want to omit the slow header search when not needed with --file?

I don't really understand the meaning. If we don't specify --file=a.h, won't we go here? Or do we need an additional flag to filter out users even if they come in?

Thank you for caring and I wish you Merry Christmas~

Dec 24 '22 11:12 xinzhengzhang

Hey @xinzhengzhang! Sorry for being slow--got buried in tasks over here. Resurfacing. I'm so sorry; I promise I'm usually faster, like with my initial replies.

Thanks for getting back on these and for making improvements. As before, I also added some quick commits on small items that seemed better to fix than to spend time discussing. Hopefully that's okay. Explanations in the commit messages. If there's anything you don't like, just say or change them. I appreciate your giving me write access.

For the things that are still open, I'll try to expand on each in more detail:

The (relatively common) case of header-only targets: Let's consider the following simplified example: Say we're trying for --file=templated.hpp

cc_binary(
    name = "main",
    srcs = ["main.cpp"],
    deps = ["lib"],
)
cc_library(
    name = "lib",
    hdrs = ["templated.hpp"], # Or srcs
)

Won't our aquery currently first filter to just lib, since it's the one with templated.hpp in hdrs...and then find no actions, since no actions are actually generated by lib, since it's header-only?

If I'm right, we'll need a creative fix here. I think I suggested this one before, but the best I've thought of so far is to first try what you did here and then fall back to the inputs query type that we use for source files, running header search until we find the first one that actually uses the header. I'd love it if you'd think on it and try to come up with something even better!

[There's also the case of system headers, which won't be listed in an attribute, but I think those would also be caught by the above. There's also the case of a library that may have compile actions but none that include the header being requested via --file. More on that in the next section...]

Omitting the slow header search Nearly all of the runtime of this tool is finding which headers are included for a given source file in _get_headers by running the clang/gcc preprocessor--that's the code that'd turned off by exclude_headers. We should similarly skip doing that work for when we use --file= for a source file, since we're only asking about the one source file, not the headers, and since you care about this running really fast.

The exception is when running --file= for a header. We'll need to use _get_headers there to figure out if an action actually includes the header specified (probably just taking the first for speed)! And if none do, then, as above, we'll need to fall back to the inputs based query.

Attr and stringified targets What about using a preliminary aquery and the outputs function to get the target for a given path? Maybe there're even better ways.

Override vs merge I think it's totally fine to just have --file imply mergeing with no separate flag, unless you see value otherwise?

I'm quite excited about this! But I'll need you to take the lead on making this robust.

Cheers, Chris

Jan 30 '23 04:01 cpsauer

Wishing you also a happy lunar new year :)

Jan 30 '23 04:01 cpsauer

For headers how about we divide the header extractor into these steps

query targets which attrs(hdrs&srcs) contains the header file. If there has other source extensions like .c then we just extract this one if not we go to step 2 to process header-only library
query actions that using this header file as input (Can only be found without using module) if not found we go to step 3
traverse the target graph in postorder sort which depends the header-only lib and find the first one that actually uses the header. If we still not found then we go to step 4 (the case of system headers)
traverse the complete target graph in topological sort and compile-test a dummy source which include this header. To prevent excessive timeouts we assume a maximum of 50 attempts

# pseudocode
header = "/a/b/c.hpp"
targets = query(root_target, hdrs|srcs, header)
is_header_only_lib = targets.filter(srcs.contains(source_extension))

if targets.length:
    if is_header_only_lib:
      action = aquery(inputs, header_statement).first
      if action:
          return action.args
      else:
          targets = query(root_target, deps, targets_statement).order_by(postorder)
          args = targets.srcs.find( {
              src.get_headers().contains(header)
          } )
          if args:
              return args
    else:
        return targets.srcs.find( {
              src.get_headers().contains(header)
        } )

targets = query(deps(root_targets)).order_by(topological)
test_file = "include {header}"
args = targets[:50].find( {
    compile_test(test_file, target.actions.first.args) == true
})
if not args:
    return "extract failed"

Attr and stringified targets What about using a preliminary aquery and the outputs function to get the target for a given path? Maybe there're even better ways.

Sorry, I don't quite understand what this given path is used for

Jan 31 '23 08:01 xinzhengzhang

Something like that! I'd suggest a couple tweaks:

For 1: I don't think we can assume that the .c will include the header. Instead, we'll have to run the header search process to check, falling back to 2 if not.

(I'm not quite sure I understand the module part; as much as possible we'll want to turn those off for our analysis so that we don't depend on a build.)

For 4: I think it's okay to require an example of use rather than creating a dummy target, like we currently do for header only libraries, since clangd will automatically fall back to generic flags anyway. But if you think otherwise, please do say!

As above, attrs seems to operate on a stringified list of targets, rather than paths, which looked like it was getting in the way and making the query less specific. So, rather than just getting the filename, I was proposing that we might want to run another aquery, using outputs() to figure out the name of the target for a given file path. Does that make sense?

Feb 06 '23 07:02 cpsauer

Hey, @xinzhengzhang! I wanted to check back in on these to see how you're doing. Thanks for all your efforts; I know this is hard to get right!

Feb 28 '23 00:02 cpsauer

Hey, @xinzhengzhang! I wanted to check back in on these to see how you're doing. Thanks for all your efforts; I know this is hard to get right!

Sorry to be too busy with work recently, I will implement a basic version this weekend

Mar 03 '23 03:03 xinzhengzhang

Sorry again for the long time of inactivity due to work.

I just rebased the main branch and squash the commits. Regarding the header lib problem discussed above, I have carried out a basic implementation https://github.com/hedronvision/bazel-compile-commands-extractor/pull/99/commits/f4441c3f88949679f46f07b87feab1f668175b0c I added a big loop in _get_commands to check if the found commands match the expected, if this approach is ok, I will continue to add logic such as short-circuit finding process so as not to parse the complete graph and max attempt for finding header.

Mar 03 '23 05:03 xinzhengzhang

Totally get the busyness. I feel it, too :)

We'll definitely need to invoke query multiple times. A fixed-depth breaking loop like that works, but you might end up finding it cleaner/easier to factor out the query call-(as a nested function if you need to capture?) and then have control flow that explicitly falls through each case if empty?

(Some smaller things, trying to give fast feedback: finded -> found. And you'll want to be careful about the no-action error case, since that's now tolerated for the file_flags case.)

Mar 03 '23 08:03 cpsauer

I have simplify the loop using nested function and fixed no-action case in _get_commands https://github.com/hedronvision/bazel-compile-commands-extractor/pull/99/commits/251bbb2cc281b77b508538c3afc9a38fecfee808 Terminate the loop if header has been found or reached the max attempt times https://github.com/hedronvision/bazel-compile-commands-extractor/pull/99/commits/8df22384e2cd1f6a30c4331eef8965c047c41516

Since I don't have much experience with Python, and it is a weak type language, the implementation of passing lambda to thread handler seems very strange. If there is a good way to modify it, please also help me to modify it

Mar 03 '23 14:03 xinzhengzhang

Hi, Chris @cpsauer I would like to ask if there is anything else that needs to be modified for this pr?

In addition, I want to change my vscode extension to bzlmod form, so I want to patch the dependency compile-commands. I would like to ask if you have the possibility of tagging and being referenced? Of course, it is also possible for me to use extension to use commit integration

Mar 16 '23 11:03 xinzhengzhang

Since the entire aquery logic is nested, there are many conflicts. I just rebase main and push -f

Mar 18 '23 19:03 xinzhengzhang

Hey @xinzhengzhang. Sorry for causing--and thanks for rebasing. Apologies for being slow--got sick over here and I'm kinda dragging and backlogged. I'll try to get through this soon.

Mar 18 '23 22:03 cpsauer

Ran out of time--and am now traveling for the week. Will try to get to this weekend.

I definitely saw some bugs on a quick scan through. Could I ask you to polish in the meantime?

Mar 27 '23 04:03 cpsauer

Thanks for getting back to me. I'll do my best to identify and fix the bugs. Have a great trip!

Mar 27 '23 05:03 xinzhengzhang

Sorry, maybe due to preconceived notions, although I reviewed it again, I didn't find too many bugs... if it's ok, you can directly modify my pr as you want.

Apr 02 '23 15:04 xinzhengzhang

Hey, @xinzhengzhang! I put in some work over the past few days to move this along. It's tricky to get right!

Among a bunch of small things, I did the following:

flattened out the threading to make things simpler
got us an alternative to inputs() that works for headers (which will come up blank because bazel hides headers behind a "middleman")
Squashed some bugs, new and old.
Made it so we only run header search when we really need to, as proposed above, which should really speed things up for source files.
Marked remaining things that need to be considered with TODOs and short explanations.

From here, I'd love to pass the torch back to you in resolving some of those TODOs. It might be best to start with with the target_statement_candidates TODOs, then work through the others, saving fixing warning about no output for last. I'd love it if you'd test it out for your use case as you go!

Sorry to not be just carrying this across the finish line, but I think this is more of an enterprise-scale need for you guys and I figure you all have some resources as a big company. (We're a small startup trying to get off the ground, so the time I can spend doing other free work is limited.) But hopefully this helps unstick things, providing structure to the remaining items and the dependencies between them!

[There's another guy with a similar need who spun up a PR of his own. He works with Windows, IIRC, so I'll see if he's willing to join in and help with that part. Backlinks landing shortly.]

May 04 '23 05:05 cpsauer

TODO discuss the below. If we're focusing on a header, we return just that header. Why? Only source files are useful for background indexing, since they include headers, and we'll already save the work for header finding in our internal caches.

When we focusing on a header the response we expected should be {file: <HEADER_FILE>, arguments: [-I<SEARCH_PATH>]}. For clangd it can jump to the header file of its dependency in search path. I think that for the header file, the key is to find its transitive search path, which is why we need to find a dependency of his source file, right?

May 05 '23 07:05 xinzhengzhang

Hi, @cpsauer I have done some implementations for TODO, please help to review it :)

timeout=3 if focused_on_file else None # If running in fast, interactive mode with --file, we need to cap latency. #TODO test timeout--if it doesn't work, cap input length here, before passing in array. Might also want to divide timeout/cap by #targets

I made a simple calculation-based implementation https://github.com/hedronvision/bazel-compile-commands-extractor/pull/99/commits/89f583a40cad5bca085130e319f9bf049fcfc6fa , but found a problem in my project. The actions in the front row are usually completed very quickly due to the header cache, resulting in a very large number of actions that can be executed in timeout

TODO be careful about this case with --file

From the sort through code logic, I see that headers are produced through .d file, but I don't quite understand its relationship with --file.

TODO: Let's detect out-of-bazel, absolute paths and run this if and only if we're looking for a system header. We need to think about how we want to handle absolute paths more generally, perhaps normalizing them to relative if possible, like with the windows absolute path issue,

I tried to make a relative path if the incoming path is absolute path and it is the relative path of the current cwd to distinguish whether it is a syetem header. I don't know if it matches your idea. https://github.com/hedronvision/bazel-compile-commands-extractor/pull/99/commits/6a6777a8a5054c2cf8302537863998d3783f9a9b

switch to --include_param_files (somewhat confusingly named, but includes the contents of param files on the command line) and test on Windows. Including in this diff to not create conflicts.

Is that means if --include_param_files can be passed on windows then we change the feature flag to bazel flag?

May 05 '23 09:05 xinzhengzhang

Hi, @xinzhengzhang! Thanks for your quick work, but most of these are going to require some more experimenting and thought before they work robustly enough that we can get them in. I'd love to have this feature land for you and others--and happy to clarify/help--but I'll still need you to really think through the cases and how to handle them robustly.

Replying to each. There are some important things for us to discuss, starting with the high level. Could I get you to first read down to the first horizontal bar (---) and reply and then we'll return to the code-review level?

Re: TODO discuss the below.

I'm not sure quite what you're asking. Could I ask you to rephrase?

To save round trips, here's what's in my head: This comes up in the case where someone opens, say, my_header.h and their editor asks for --file=my_header.h, needing a compile command for my_header.h in order to provide good autocomplete. Only source files are run through the compiler directly. Therefore, to understand how my_header.h is compiled, we need to get the command for a source file that includes my_header.h, directly or indirectly. That command contains a bunch of things relevant to how the autocomplete should go--both flags that adjust the explicit include paths, like you list, but also many other flags that affect how the source is interpreted, like, for example, additional defines (-D). Does that align with your understanding and needs?

What that discussion item is about: As we search to find a source file that includes my_header.h, we get, for free, commands that apply to other files, like the other headers included by that source file. We need to decide which of those (file, command) pairs we should add to compile_commands.json.

On one hand, we could just add all the pairs we got for free. But that might kick off a lot of background indexing work, and my understanding is that you want this feature in order to minimize the indexing work required to browse files. Adding a few source files in the background is potentially quite useful for go-to-definition and the discovery of other symbols. But other headers seem fairly pointless and high cost: they don't introduce any symbols not already seen from processing a source file (which pulls in its includes) and they're likely to keep getting new (but equivalent) commands that trigger a lot of reindexing. Hence that tradeoff: Only add/update the command for my_header.h and no other headers--but still add/update commands for source files found along the way. What do you think of that tradeoff and why? Does that seem optimal or would you like something else? That was my guess, but you know your use case better than I do (like with allpaths!). You can also go with this for now, experiment around as you use it, and then adjust it if needed. Either way, we should eventually document what we decide and why, so please add to the comments until they'd have been sufficient for you to understand on a first read.

Aside: reload behavior

Are you doing anything special to get good reload behavior from clangd in the editor? I'm seeing that clangd needs a modify/save event to trigger it getting a new command. Is that what you're seeing?

--include_param_files

Right. (If I'm correctly understanding what you're writing.) Do you use Windows and understand what's going on with param files for Windows? (Totally ok if no on either.) Assuming I'm right in remembering that LoSealL does some development on Windows machines--we might want to ask him if he'd look into these Windows things. Either way, would you be willing to give LoSealL commit access to this branch of your fork so he can help out as needed?

TODO be careful about this case with --file

Also windows-specific stuff. No .d files involved here. Are you developing on windows at all, or should we be looking to ask LoSealL on these?

[To explain more: The concern is that MSVC on windows is producing absolute paths for all headers, including in the workspace, which then won't match in statements elsewhere like if file_path in headers.]

timeouts

I saw your email (now edited) about timeout not working. I was similarly having trouble getting it to trip--hence the ask that you investigate. How did that resolve? Were you able to get the timeout to activate?

It's not obvious to me what you're going for with the timeout code you added--nor why it's bad to be able to traverse many actions if it's done fast. Could you explain?

[I had been thinking that, if we could get the timeout to work, we might want to allocate fewer seconds if there were, e.g., 3 sequential calls to aquery for different top-level targets specified in refresh_compile_commands. Perhaps 1s each instead of 3s each, to maintain a total latency number. I wasn't envisioning any fancy calculating, but you may have thought of something good I haven't!]

TODO: Let's detect out-of-bazel, absolute paths and run this if and only if we're looking for a system header. We need to think about how we want to handle absolute paths more generally, perhaps normalizing them to relative if possible, like with the windows absolute path issue,

This is going to require thinking about what you might pass in from the outside. What about absolute paths to the canonical location of bazel-out or external (i.e. at the other end of the symlink)?

Further, once we've recognized system headers (i.e. absolute paths in none of those locations?) aren't those the only ones we want to run the (expensive) deps(target) query for?

[From your email/past edits] inputs(headers) and allpaths() being too slow

How did this resolve? We definitely want this to be fast and fast enough for you. To answer the questions you'd asked: I thought I was seeing that headers never appeared under inputs, because they were hidden behind middlemen. Are you seeing otherwise? (It looks like you're seeing the same from you edits, but I want to make sure.) If that's right, then we need an alternative. If allpaths is too slow, could you see if the --universe_scope & rdeps(,) alternative I'd suggested there might work? Or think about other possibilities? I really need your help here, since you're the one with the enormous codebase to test on.

Moving to commits

More generally, could I ask you to not delete TODOs until we're confident they've been handled robustly? And put back the ones you did? Makes tracking much easier.

Some significant problems remain [see below], so maybe give these a quick skim, but read the "looking back at the commits" sub-section at the bottom before editing.

From ef335095c91eb5ca53a7c02d6e4ba7b15bccf3f0

Come on--we'll need to do quoting in all places where we splice a target into an aquery command, as it said! Please be proactive on these :) Wish I'd had it right from the start--but I figure you'd need to know for mutating the query and don't want to cause an unnecessary merge conflict.

From f8b046e3e23e4fce4a65bac0eef8ac57cb0f5e4f

Not obvious this makes things better? Appreciate the effort, but how about the aquery with inputs(), to get the targets that produce the file, suggested for exploration in the TODO? If I've misunderstood and this works well, or that doesn't work, please tell me.

Not found warnings 9ce17168b54a2838705114b94c28bba3fb3efb84

Again, let's return to these at the end. We're still going to get a lot of erroneous ones of the first kind, I think, for the reasons in the (deleted) TODO.

40b9131c9dd4821b538c750aaf98c084762ec026

Thanks. Can you put back the documenting comments from the yield and not use SimpleNamespace, since I think we'll have to strip it for #118. I'm also realizing this part of the comment is obsolete ", with commas after each entry,"--could you delete? (more generally, if you happen upon mistakes like that, please just fix).

0453be435112dfec20ff549273ab69347c28df4b

I see you moved the loop inside--but not the target_flag_pairs template filled by bazel. Don't we want that inside, too? I think the structure here may simplify significantly if we end up with one aquery command per branch (but maybe we won't be able to get there.

Looking back at the commits

If it were me, I'd be pretty tempted to reset the branch back to bd7bde160b16d2258b82f75010c5e2fd2d97acc0 and move forward from there. There are some good things in the commits. But since each needs work and a bunch of (TODO and other) lines need to be restored anyway, I'd think restoring to bd7bde160b16d2258b82f75010c5e2fd2d97acc0 and then moving forward, copying in useful parts of the previous commits, might be the way to go. Maybe I should be just doing that, but I don't want to step on your toes too hard.

Moving forward, I'd like to ask that you make sure you've got higher confidence before deleting those TODOs and pushing code for review. It's totally ok to not resolve all the TODOs, especially all at once. But it's taking too much time to review and discuss rough versions of the code with issues I suspect you could easily have caught without me. Better to clarify first, discussing if needed, and tell me where you aren't sure or need me to do something. Then, focus individually on the ones where you have confidence, doing some self-review before pushing changes, making sure you've actually resolved the things in the comments and from our discussion. Since this is a big change, another tactic for changes where you're uncertain would be to file PRs against your own branch. That way, we can easily discuss each in its own context. Please don't get me wrong; I massively appreciate your energy and enthusiasm to contribute. I'm just trying to find more efficient ways for us to work together to make this happen.

Thanks, Chris

May 06 '23 02:05 cpsauer

Hi, @cpsauer!

Very very thanks for your super patiently mentoring. There are some places I really didn't get what you wanted to express, such as the part in #discuss, but I fully understand after your detailed delineation As for deleting TODO, it may be my habit to rebase frequently based on a single commit, which will cause the history to be lost. It is indeed inconvenient to quote and discuss. I will revert all this commits and file PRs which fix TODO from other branches in the future.

Re: TODO discuss the below.

Yes, it totally align with my understanding and needs.

We need to decide which of those (file, command) pairs we should add to compile_commands.json.

In my opinion, the header file should not write anything other than the frame header file, for example in the iOS should just import < Foundation/Foundation.h > and < UIKit/UIKit.h > All other nested dependencies should be in the form of previous declarations. In this view, in fact, we just need to find one source file that directive contains this header, other more indexes are just waste and interference. Something like Xcode that indexes the entire project can also cause the header file to actually have the ability to prompt a large number of files, and we've even had a case where we wrote a source file with the wrong dependency in the header file.

// We had a bug like this, no matter how we modify file_c.m it didn't work cause file_a.m include all content in file_c.m
// When we modified file_c.m no changed with file_b.h and no dependencies changed with file_a.m cause it use cache.
# file_a.m
# import file_b.h

# file_b.h
# import file_c.m

That's why I lean towards this

Are you doing anything special to get good reload behavior from clangd in the editor? I'm seeing that clangd needs a modify/save event to trigger it getting a new command. Is that what you're seeing?

In the vscode extension, I observed change of BUILD and WORKSPACE to refresh the compile_commands

--include_param_files TODO be careful about this case with --file

About file_path part, I will give commit access to @LoSealL to improve the windows part together. ButI digged at _get_headers_msvc code even if it returns the absolute path did not see his relationship with --file, I understand that --file is just to narrow the scope of the query action, understand wrong?

I will reply to other comments later

May 06 '23 04:05 xinzhengzhang

timeouts I saw your email (now edited) about timeout not working. I was similarly having trouble getting it to trip--hence the ask that you investigate. How did that resolve? Were you able to get the timeout to activate?

It's not obvious to me what you're going for with the timeout code you added--nor why it's bad to be able to traverse many actions if it's done fast. Could you explain?

I have write minimum code replicate this bug I'm guessing that if you iterate outside of scope, the internal implementation of ThreadPoolExecutor will not pass the call next so that no timeoutError is thrown, and the full execution is performed

import os
import concurrent.futures
import time

def _foo(i):
    time.sleep(1)
    return i

print("worker count = {}".format(os.cpu_count()))

with concurrent.futures.ThreadPoolExecutor(
    max_workers=min(32, (os.cpu_count() or 1) + 4)
) as threadpool:
    task = [i for i in range(100)]
    output_iterator = threadpool.map(
        lambda i: _foo(i),
        task,
        timeout=3
    )
#   # work
#   try:
#       print("before worker current time = {}".format(time.time()), flush=True)
#       for output in output_iterator:
#           print("current time = {} output = {}".format(time.time(), output), flush=True)
#           
#   except concurrent.futures.TimeoutError:
#       print("timeout")

# not work
try:
    print("before worker current time = {}".format(time.time()), flush=True)
    for output in output_iterator:
        print("current time = {} output = {}".format(time.time(), output), flush=True)

except concurrent.futures.TimeoutError:
    print("timeout")

// https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor
map(func, *iterables, timeout=None, chunksize=1)
Similar to [map(func, *iterables)](https://docs.python.org/3/library/functions.html#map) except:

the iterables are collected immediately rather than lazily;
func is executed asynchronously and several calls to func may be made concurrently.
The returned iterator raises a [TimeoutError](https://docs.python.org/3/library/exceptions.html#TimeoutError) if [__next__()](https://docs.python.org/3/library/stdtypes.html#iterator.__next__) is called and the result isn’t available after timeout seconds from the original call to [Executor.map()](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map). timeout can be an int or a float. If timeout is not specified or None, there is no limit to the wait time.

It's not obvious to me what you're going for with the timeout code you added--nor why it's bad to be able to traverse many actions if it's done fast. Could you explain?

Now that this bug is solved, this complex estimation method is no longer needed

May 06 '23 06:05 xinzhengzhang

Hey! Yes. Thanks for being so coachable and energetic.

You're right that some of my todo notes were fairly inscrutable. Sorry about that. They were initially meant to be notes to myself and then, well, I figured I'd just clean the up a little bit, hoping they'd either make sense or you ask. Wish I'd done a better job.

Other items:

I'm not understanding what you're saying about the headers. Just language barrier, I think. Could I ask you to take another shot at explaining? Wish I could read Chinese as well as you can English :)

Reload: Ah, sorry, I mean getting clangd to reload the new changes from an updated compile commands.json. Does that make more sense?

Thanks for giving LoSealL access! (I think just ignore that stuff for now. But if you're curious but the problem will be as above. If you pass in a relative path it won't match the absolute paths returned by MSVC.)

May 06 '23 06:05 cpsauer

Re timeout: Whoa, that is very surprising to me. If you figure out why, I would be very curious. Sorry about leaving that bug for you.

(I'll consult with a friend who's a python expert to try to figure out what's going on. My guess is that the threadpool tries to finish all of its actions before exiting scope, but then neglects to check for timeouts?)

May 06 '23 06:05 cpsauer

Hey! Yes. Thanks for being so coachable and energetic.

You're right that some of my todo notes were fairly inscrutable. Sorry about that. They were initially meant to be notes to myself and then, well, I figured I'd just clean the up a little bit, hoping they'd either make sense or you ask. Wish I'd done a better job.

Other items:

I'm not understanding what you're saying about the headers. Just language barrier, I think. Could I ask you to take another shot at explaining? Wish I could read Chinese as well as you can English :)

Reload: Ah, sorry, I mean getting clangd to reload the new changes from an updated compile commands.json. Does that make more sense?

Thanks for giving LoSealL access! (I think just ignore that stuff for now. But if you're curious but the problem will be as above. If you pass in a relative path it won't match the absolute paths returned by MSVC.)

header

In chines we usually call file which exposed interface are "头文件". Literally translated as header file is also the header I mentioned. For some high-level languages such as swift or python or cpp module usually we do not use the word header to describe their interface.

In case I'm not making my point, I'm going to go into more detail.

Under the c system language, using like #include.h or some variant like #import is actually to expand its content into source. There is no specification for what should be written in the header file, so there are many usages, such as definitions, macros, umbrella headers, etc. Many usages. But I think the main use of header files should be exposed as interfaces. If exposed as a module(source file) interface, then if there is more interference from other contexts of the complete project, it will greatly interfere with the writing of R & D. And in fact, we have also had a lot of hidden problems that cause problems by writing casually in the header file. So I will require only dependency system headers in our team header file, because single responsibility one module should not have too many dependencies other as exposed.

Reload

I'm using clangd wrapped via sourcekit-lsp, and it works well when compile_commands changes, I'm not too deep into his specific implementation, you can start from https://github.com/apple/sourcekit-lsp/blob/90fd3addb28a7b3cb2ef40acca6670cd57cbdd53/Sources/SourceKitLSP/SourceKitServer.swift#L730 to refer to his line of sight if necessary

abspath for msvc

From here I saw that you will transfer absolute path extracted from process to relative path. It seems that if you pass in the relative path, it can match. If you pass in the abs path, it will be converted to the relative path in the input. But I don't have the windows environment and experience now. I will test it on the windows platform later.

def _file_is_in_main_workspace_and_not_external(file_str: str):
    file_path = pathlib.PurePath(file_str)
    if file_path.is_absolute():
        # MSVC emits absolute paths for all headers, so we need to handle the case where there are absolute paths into the workspace
        # TODO be careful about this case with --file
        workspace_absolute = pathlib.PurePath(os.environ["BUILD_WORKSPACE_DIRECTORY"])
        if not _is_relative_to(file_path, workspace_absolute):
            return False
        file_path = file_path.relative_to(workspace_absolute)
    # You can now assume that the path is relative to the workspace.
    # [Already assuming that relative paths are relative to the main workspace.]

    # some/file.h, but not external/some/file.h
    # also allows for things like bazel-out/generated/file.h
    if _is_relative_to(file_path, pathlib.PurePath("external")):
        return False

May 06 '23 07:05 xinzhengzhang

Sorry--still totally don't understand the point about the headers. Like I understand all the words but not what you think we should do here.

Reload: Sounds like sourcekit has better behavior here. Heads up that clangd doesn't reload until the file is modified or saved, sadly. Not a problem for you using sourcekit, but something we'll need to keep in mind for this project.

Re msvc absolute header paths: Totally agree that _file_is_in_main_workspace_and_not_external itself works fine. I was just concerned about the behavior documented there causing issues elsewhere. Is that the confusion? If so, totally my fault for writing the TODO ambiguously and poorly--feel free to move it or delete it.

May 06 '23 08:05 cpsauer

TODO: Let's detect out-of-bazel, absolute paths and run this if and only if we're looking for a system header. We need to think about how we want to handle absolute paths more generally, perhaps normalizing them to relative if possible, like with the windows absolute path issue, This is going to require thinking about what you might pass in from the outside. What about absolute paths to the canonical location of bazel-out or external (i.e. at the other end of the symlink)?

Further, once we've recognized system headers (i.e. absolute paths in none of those locations?) aren't those the only ones we want to run the (expensive) deps(target) query for?

For now (in my extension) all --file are passed in as relative paths and to be honest, this is indeed the most bazelify input excepted. Source file: srcs/a/b/c.m External file: external/a/b/c.h

For absolute path contains WORKSPACE root /Users/xx/project/srcs/a/b/c I think it is easy to handle and it is current implementation. But for some path like bazel-out/<CONFIGURATION>/bin/srcs/a/b/c.m /<OUTPUT_BASE>/execroot/_main/srcs/a/b/c.m I can indeed identify some features such as execroot/_main bezel-out to force the match, but I think this robustness is too poor and may be affected by the bazel change. Regarding this, your suggestion is that we directly prompt the error, or help them as much as possible to convert to the desired above two kind of paths?

inputs(headers) and allpaths() being too slow

How did this resolve? We definitely want this to be fast and fast enough for you. To answer the questions you'd asked: I thought I was seeing that headers never appeared under inputs, because they were hidden behind middlemen. Are you seeing otherwise? (It looks like you're seeing the same from you edits, but I want to make sure.) If that's right, then we need an alternative. If allpaths is too slow, could you see if the --universe_scope & rdeps(,) alternative I'd suggested there might work? Or think about other possibilities? I really need your help here, since you're the one with the enormous codebase to test on.

What I seeing

The reason I modified the comment is because I found that the reason why I can find the header file through inputs and it is not hidden by middleman is that I use objc_library dependency swift_library, and the implementation of swift is to provide the header in the https://bazel.build/rules/lib/providers/CcInfo.html#compilation_context so that I can query, but this is not a common practice

Benchmark

I have tested with two header file one is upper module in srcs/app and one is underlying module in srcs/base bazel_version: 6.1.1 cmd: time bazel aquery "mnemonic('(Objc|Cpp)Compile', allpaths(//bili-universal:bili-universal, let v = deps(//bili-universal:bili-universal) in attr(hdrs, '//srcs/base/xxx:include/xx.h', \$v) + attr(srcs, '//srcs/base/xxx:include/xx.h', \$v)))" --output=jsonproto --include_artifacts=false --ui_event_filters=-info --noshow_progress --features=-compiler_param_file --features=-layering_check > /dev/null

srcs/app/x.h

allpaths 1.14s user 2.17s system 8% cpu 39.409 total rdeps 1.17s user 2.13s system 8% cpu 37.865 total rdeps with --infer_universe_scope 1.14s user 2.16s system 8% cpu 37.424 total

srcs/base/x.h

allpaths 1.17s user 2.13s system 8% cpu 39.709 total rdeps 1.17s user 2.14s system 8% cpu 38.149 total rdeps with --infer_universe_scope 1.18s user 2.16s system 8% cpu 39.708 total

other possibilities

Why not just add a switcher for expensive search for allpaths and deps to user? If we can't find their dependency from attr, it is unlikely not used in current practice. When it appears, it also jumps to some useless header files that do not need to be parsed.

May 06 '23 08:05 xinzhengzhang

What that discussion item is about: As we search to find a source file that includes my_header.h, we get, for free, commands that apply to other files, like the other headers included by that source file. We need to decide which of those (file, command) pairs we should add to compile_commands.json.

On one hand, we could just add all the pairs we got for free. But that might kick off a lot of background indexing work, and my understanding is that you want this feature in order to minimize the indexing work required to browse files. Adding a few source files in the background is potentially quite useful for go-to-definition and the discovery of other symbols. But other headers seem fairly pointless and high cost: they don't introduce any symbols not already seen from processing a source file (which pulls in its includes) and they're likely to keep getting new (but equivalent) commands that trigger a lot of reindexing. Hence that tradeoff: Only add/update the command for my_header.h and no other headers--but still add/update commands for source files found along the way. What do you think of that tradeoff and why? Does that seem optimal or would you like something else? That was my guess, but you know your use case better than I do (like with allpaths!). You can also go with this for now, experiment around as you use it, and then adjust it if needed. Either way, we should eventually document what we decide and why, so please add to the comments until they'd have been sufficient for you to understand on a first read.

Only add/update the command for my_header.h and no other headers--but still add/update commands for source files found along the way. I prefer this, the header file should has as few indexable symbols as possible. The reason is that for example a library has 10 source files and 10 header files, and there should be no or as few dependencies between header files and dependencies should be implemented in source files.

May 06 '23 08:05 xinzhengzhang

From https://github.com/hedronvision/bazel-compile-commands-extractor/commit/f8b046e3e23e4fce4a65bac0eef8ac57cb0f5e4f TODO consider running a preliminary aquery to make this more specific, getting the targets that generate the given file. Use outputs() aquery function. Should also test the case where the file is generated/is coming in as a filegroup--that's likely to be fixed by this change.

Not obvious this makes things better? Appreciate the effort, but how about the aquery with inputs(), to get the targets that produce the file, suggested for exploration in the TODO? If I've misunderstood and this works well, or that doesn't work, please tell me.

See The reason I modified the comment is because I found that the reason why I can find the header file through inputs and it is not hidden by middleman is that I use objc_library dependency swift_library, and the implementation of swift is to provide the header in the https://bazel.build/rules/lib/providers/CcInfo.html#compilation_context so that I can query, but this is not a common practice

When I try to use aquery to get outputs which passed into next aquery process, there was hundred of actions was queried

# bazel aquery "(inputs('x.h', deus('<TARGET>'))" --output=summary
Mnemonics:
  Middleman: 2
  SwiftCompile: 118
  SwiftDumpAST: 118

Take SwiftCompile as an example, it dependency header to generate modulemap, but has nothing to do with this header in the compile parameter. Join all outputs from queried outputs not seems a good idea. But if I add a Middleman as Mnemonics qualifier, it can't handle logic like generated file. So I chose to use bazel query, both source files and generated files can be involved. The tradeoff is we can't handle the file_path like 'bazel-out/xx/generated_file.h' In fact, when I was writing, I also considered reverse condition

# Currently
file_candidates = list(filter(lambda label: file_path in label.replace(':', '/'), label_candidates))

# Maybe
# The generated file always keeps its label path as a subset
filter( lambda label: label.normalize() in file_path, label_candidates))

May 06 '23 10:05 xinzhengzhang

bazel-compile-commands-extractor bazel-compile-commands-extractor copied to clipboard

file based filter

srcs/app/x.h

srcs/base/x.h

bazel-compile-commands-extractor
bazel-compile-commands-extractor copied to clipboard