lean4 feat: lake: add `lake dump-config` command

This adds a mechanism to dump most of lake's workspace configuration state (everything except the functions, more or less) to standard out in JSON format. This is roughly equivalent to the lake-ext command from lean-cache, except that it doesn't save the results to disk and instead lets the querying process do what it wants to with the information. It also includes a version field, although I don't know how much to worry about versioning the output format here.

Sep 14 '23 08:09 digama0

🟡 Mathlib branch lean-pr-testing-2542 build this PR didn't complete normally. (2023-09-14 09:47:44) View Log
🟡 Mathlib branch lean-pr-testing-2542 build this PR didn't complete normally. (2023-09-14 11:18:41) View Log

Sep 14 '23 09:09 leanprover-community-mathlib4-bot

@digama0 I am still not sold on the utility of this and it seems like a burden to maintain. Why is this necessary?

Sep 14 '23 17:09 tydeu

Doing this outside of lean core has proven to be very flaky, as the user needs to make sure to have the right version of lake-ext for the toolchain or else it will segfault. It's one of the main reasons lean-cache can't be distributed right now. And the maintenance should be quite simple, most of it is just ToJson derivers which should adapt to any changes to the core (currently this has to be done manually for every change to lake). This is currently causing trouble for @semorrison who was trying to use lean-cache in mathlib's nightly-testing, lean-cache needs to have its own nightly branch even though it is much more stable across lean nightlies than lean is.

Having machine readable outputs from a build system is very useful for interop with other build systems as well. This is one of the simplest changes I can think of that just gets this data out of lake so it can be read by the server or any other system that wants to work with lake without having to call lake for each query (and incur the startup cost) which is what lean currently has to do for print-paths.

The main thing that seems like it could be a maintenance issue is keeping the json output for Package consistent with any new fields. (1) this isn't exactly necessary, you can just leave the existing fields and things will still basically work, and (2) I think this is actually a feature request for @[json_skip] annotations on fields, which would make this very easy to keep up to date.

Sep 14 '23 19:09 digama0

@digama0

as the user needs to make sure to have the right version of lake-ext for the toolchain or else it will segfault. It's one of the main reasons lean-cache can't be distributed right now.

It seems more reasonable to me for Lake to call lean-cache with the set of files to fetch than the other way round (i.e., have lean-cache extract data from Lake via lean-ext or this lake dump-config). We also already want to collect the information on what lake would build anyway (see #2528).

I anticipate your objection to this is that your goal is to avoid the Lake startup cost. Since all cache fetches will inevitably be followed by a lake build or a lake print-paths in the server, this startup cost will need to be paid anyway. Thus, this does not seem persuasuive.

Having machine readable outputs from a build system is very useful for interop with other build systems as well

I do not yet see an example of a notable interop tool that needs this data (i.e., could not be done the other way round). If you can find one, I would then be motivated to make a change like this.

which is what lean currently has to do for print-paths

print-paths also builds files, so it needs the full lake startup anyway

The main thing that seems like it could be a maintenance issue is keeping the json output for Package consistent with any new fields. (1) this isn't exactly necessary, you can just leave the existing fields and things will still basically work

Yes, this is the maintain burden I mean. It is not acceptable to expose features that the core team does not intend to maintain. Others will depend on them and be confused if they do not work as expected. So even if it will basically work for your use case without maintainence, that is insufficient.

Sep 15 '23 00:09 tydeu

It seems more reasonable to me for Lake to call lean-cache with the set of files to fetch than the other way round (i.e., have lean-cache extract data from Lake via lean-ext or this lake dump-config). We also already want to collect the information on what lake would build anyway (see #2528).

I want access to this information so I can do arbitrary other things with it, not just lean-cache. That's just today's application. There are plenty of other tools that can make use of this information, and having to write a protocol that changes with every version is very difficult to work with. Lean is supposed to be a hackable platform, so keeping this data under wraps seems like the wrong direction.

I anticipate your objection to this is that your goal is to avoid the Lake startup cost. Since all cache fetches will inevitably be followed by a lake build or a lake print-paths in the server, this startup cost will need to be paid anyway. Thus, this does not seem persuasuive.

lake env can also be sped up using cached workspace data. I'm seriously considering just forking lake and/or writing a lake2 with this kind of mentality. I just don't know what to say when the slightest bit of interop is too much for you.

I do not yet see an example of a notable interop tool that needs this data (i.e., could not be done the other way round). If you can find one, I would then be motivated to make a change like this.

Sure, I'll just reimplement lean. And also ship it every night so as to be ABI compatible with the lean on peoples' computers. /s You do understand how difficult it is to keep up with lean externally, don't you?

I have a tool waiting in the wings with several advantages over the existing system and a significant measurable performance improvement, but that's not enough for you. How can I even justify building an interop tool that needs this data if I know it's going to be blocked like this?

The main thing that seems like it could be a maintenance issue is keeping the json output for Package consistent with any new fields. (1) this isn't exactly necessary, you can just leave the existing fields and things will still basically work

Yes, this is the maintain burden I mean. It is not acceptable to expose features that the core team does not intend to maintain. Others will depend on them and be confused if they do not work as expected. So even if it will basically work for your use case without maintainence, that is insufficient.

It's not that it's not being maintained, it is a valid behavior to keep the old format so as to avoid changing the schema. That is, you could either choose to keep the output stable, or put every new field in the output. Both approaches are relatively simple and I don't know what I can do or say to make it any easier for you.

Sep 15 '23 01:09 digama0

@digama0

I want access to this information so I can do arbitrary other things with it, not just lean-cache

You can get this information already, though? A Lake script or target can run another program with this information. Or a Lean program can import Lake to get the data and export it. All this change does is add ToJson instances to a bunch of stuff in Lake along with a command to export a subset of the configuration. Why does that need to be maintained in core?

Despite that, I am happy to add the deriving ToJson to the data structures it works for by default (if that is helpful). My objection is to the Script, Package. and WorkspaceExport subset and the associated dump-config. Having something this implementation-detail focused as a public API would be costly to maintain.

and having to write a protocol that changes with every version is very difficult to work with.

And this why upstreaming the protocol would be a costly maintenance burden. I am also not sure how this would break any less. If fields are moved, renamed, reorganized, changed etc. any downstream package would break anyway.

Lean is supposed to be a hackable platform, so keeping this data under wraps seems like the wrong direction.

Lean is hackable in the sense that downstream projects can extend and inspect it in various ways (which can be done with Lake very easily). Its core is only extensible if there exists internal people to maintain it and that requires a persuasive reason to do so. This is why there is an RFC process.

I have a tool waiting in the wings with several advantages over the existing system and a significant measurable performance improvement,

I presume you mean lean-cache. Why is having a Lake script/target/etc. that invokes lean-cache not a sufficient solution?

That's just today's application. There are plenty of other tools that can make use of this information.

Do you have other concrete examples? Remember external PRs for new features are suppose to go through the RFC process, which includes "collecting examples and concerns".

It's not that it's not being maintained, it is a valid behavior to keep the old format so as to avoid changing the schema.

How would an arbitrary old schema be useful a year from now if whole configuration structure has radically changed? In order for it to be useful it has to somehow correspond to the current configuration structure of Lake and thus needs to be maintianed.

Sep 15 '23 04:09 tydeu

@digama0

I want access to this information so I can do arbitrary other things with it, not just lean-cache

You can get this information already, though? A Lake script or target can run another program with this information. Or a Lean program can import Lake to get the data and export it. All this change does is add ToJson instances to a bunch of stuff in Lake along with a command to export a subset of the configuration. Why does that need to be maintained in core?

Because it needs to be rebuilt with every version of lean, on pain of segfaults. What you are describing is what I'm already doing, and it's difficult to provide a stable interface to users if the binary interface keeps changing. This exposes a JSON interface instead, which can change much more slowly and in an incremental way.

Despite that, I am happy to add the deriving ToJson to the data structures it works for by default (if that is helpful). My objection is to the Script, Package. and WorkspaceExport subset and the associated dump-config. Having something this implementation-detail focused as a public API would be costly to maintain.

What about it is implementation-detail? It's just a subset of the fields, and the fields being exposed are all quite natural (the list of dependencies of a package, the names of scripts, etc). Are these things you expect to change in the near future? Even if so, the protocol was versioned for exactly this reason.

Lean is supposed to be a hackable platform, so keeping this data under wraps seems like the wrong direction.

Lean is hackable in the sense that downstream projects can extend and inspect it in various ways (which can be done with Lake very easily). Its core is only extensible if there exists internal people to maintain it and that requires a persuasive reason to do so. This is why there is an RFC process.

Tell me how I can get this information out of lean without having to rebuild my program every day, please. Currently no such method exists, which is why I'm proposing an MVP export that allows for cross process communication with lake.

I have a tool waiting in the wings with several advantages over the existing system and a significant measurable performance improvement,

I presume you mean lean-cache. Why is having a Lake script/target/etc. that invokes lean-cache not a sufficient solution?

Because that would involve modifying the users' lakefile, this is a tool that works on lakefiles, and is intended to work for any arbitrary lake project.

That's just today's application. There are plenty of other tools that can make use of this information.

Do you have other concrete examples? Remember external PRs for new features are suppose to go through the RFC process, which includes "collecting examples and concerns".

Example projects that benefit from project config information:

import graph visualization
lake env
#2456
lake itself or a successor
integration with external build systems (cargo, cmake)

The issue here is that lake has a significant startup cost. It's not huge, but it is large enough to make any kind of repeated query prohibitively expensive. This export is an MVP to allow tools to be able to avoid lots of back and forth with lake.

and having to write a protocol that changes with every version is very difficult to work with.

And this why upstreaming the protocol would be a costly maintenance burden. I am also not sure how this would break any less. If fields are moved, renamed, reorganized, changed etc. any downstream package would break anyway.

No, many fields don't matter for the export, and additions don't break downstream packages. Since I have started maintaining this schema it has only seen additions, no removals or moves, but it has had lots of ABI breaking changes requiring a rebuild. JSON is well suited to schemas that have lots of additions and not many removals. If you want to move something, well okay, there's a version field for that. If you want me to maintain this I will. "Costly" is very much overstating the issue here, it's literally just an extraction function right next to the definition. I can write it in a way such that it will cause a compile error if any fields are changed, if you are worried about forgetting to update it.

It's not that it's not being maintained, it is a valid behavior to keep the old format so as to avoid changing the schema.

How would an arbitrary old schema be useful a year from now if whole configuration structure has radically changed? In order for it to be useful it has to somehow correspond to the current configuration structure of Lake and thus needs to be maintianed.

The configuration structure can't change significantly without lakefile structure changing too. It currently is just a description of the core concepts of lake itself: projects, libs, exes, other targets. If that structure changes, obviously the config needs to change too, but this is still basically the same as it has been for the past year.

Sep 15 '23 05:09 digama0

@digama0

My overarching concern here is that this seems like an XY problem. This fixes a problem for issues that seem like they would be best fixed elsewhere.

Tell me how I can get this information out of lean without having to rebuild my program every day, please.

Why is rebuilding the program every day a problem? Build systems and CI exist to make this as painless as possible. It should not be hard for an end-user to fetch updates and/or rebuild Lean-related tooling. Tooling is expected to be toolchain-dependent and thus require rebuilds for each new supported toolchain. One main goal of Lake is to help facilitate this. If there is a problem here, that seems like the place a fix should occur.

Because that would involve modifying the users' lakefile, this is a tool that works on lakefiles, and is intended to work for any arbitrary lake project.

Why can an arbitrary lake project not just add a require leanCache ... and then run cache through Lake's frontend? The end goal for cache is be to integrated with Lake as part of a Lake build so this seems like the right direction.

Example projects that benefit from project config information: [..] [Lake statip cost]'s not huge, but it is large enough to make any kind of repeated query prohibitively expensive. This export is an MVP to allow tools to be able to avoid lots of back and forth with lake.

Of the examples you gave, only lake env seems to be the only one that could/should generally be repeatedly queried. An import graph visualization would either be a one-time generation or a UI where a 1s startup cost is not a major overhead (things like Office and Adobe products and often VSCode have much longer startups). For the server, we want to load the build functions as well and we want to do that once, so that is best solved via tighter integration between the LSP server and Lake. External build systems presumably also need to run the build functions so the static configuration is insufficient. Finally, lake env seems like it would best to be fixed internally in Lake (by just caching the computed environment variables somewhere specific it can find them --- even potentially as a header of something like the lakefile.olean) and not be part of a public dump.

This is what I mean by this being an XY problem. This, as far as I can tell, does not appear to be the principled solution to any of the things it purports to solve at the moment. Furthermore, its absence does not make workaround impossible, just not first-class (which is to be expected, as they are not the principled solution). However, it seems clear that you see otherwise. The point of the RFC process is to help elucidate the reasons behind a change in detail and discuss it with the wider community so as to avoid this kind of back-and-forth as we try to see things for each other's perspective with little insight into each other's designs.

If that structure changes, obviously the config needs to change too, but this is still basically the same as it has been for the past year.

Note that Lake has not changed much at all this past year, partly because I had responsibilities elsewhere. That will not be the case in the future.

What about it is implementation-detail? It's just a subset of the fields, and the fields being exposed are all quite natural (the list of dependencies of a package, the names of scripts, etc). Are these things you expect to change in the near future?

Alll of it, the dep/lib/exe/pkg configurations are all implementation-detail, Only the user-facing lakefile interaction with the options documented in the README is public. Configuration types and the way Lake stores and aggregates them is internal and can change at a moment's notice (and did change just recently with leanLibs/leanExes now being ordered). It is even possible for pre-processing step to be added between the external config users see and the internal config Lake sees (e.g., an InternLeanLibConfig or ExternLeanLibConfig). A number of upcoming TODOs could very easily require such changes.

Even if so, the protocol was versioned for exactly this reason.

And this creates maintenance burden, causes trouble for users when things break, and creates an expectation of continued compatibility. There has to be a strong reason to add such unforeseen burdens. I really suggest you create an RFC about this and discuss it with the community and then come back with a strong consensus. That is the best way to ensure a change like this gets made.

Sep 15 '23 07:09 tydeu

Why is rebuilding the program every day a problem? Build systems and CI exist to make this as painless as possible. It should not be hard for an end-user to fetch updates and/or rebuild Lean-related tooling. Tooling is expected to be toolchain-dependent and thus require rebuilds for each new supported toolchain. One main goal of Lake is to help facilitate this. If there is a problem here, that seems like the place a fix should occur.

Currently there is no way to distribute material to appear bundled with lean other than contributing to this repository. This is the most reliable way to ensure that things never go out of sync. Even if it was to be distributed separately, elan does not support distributing anything other than lean+lake right now, and elan is the only thing I can reliably assume exists on users' machines. Hence the best way to solve the distribution problem is to get this functionality integrated into lake.

Maybe someday we can have something like cargo install where you can download, compile, and install a third party tool and maybe keep it up to date somehow. But that is not today, right now we just have elan and a massive bundle handed to users as the only thing which is toolchain-tracked. Trying to track all the versions of lean in flight is practically infeasible, especially now that we are building toolchains for every PR. It's similar to trying to maintain a tightly-integrated mathlib fork, it's way too much for anyone externally to be able to handle. And for a program which is literally just a tiny JSON export? I will go mad.

If this turns out to be a bad idea and we want to drop it later, okay, it's not irreversible. I will stridently argue against removing it if I'm still using it, but if you are right and there are better ways to accomplish my goals then so be it. But I would like to ship lean-cache this year, and for the present this is an integral part of its operation.

Because that would involve modifying the users' lakefile, this is a tool that works on lakefiles, and is intended to work for any arbitrary lake project.

Why can an arbitrary lake project not just add a require leanCache ... and then run cache through Lake's frontend? The end goal for cache is be to integrated with Lake as part of a Lake build so this seems like the right direction.

No, the end goal is for this to be a tool that sits where lake sits, mediating the download and compile process. It is not itself part of the build pipeline, any more than you would write require lake .... This would lead to circularities because it is caching the build pipeline. Ultimately, it should be a part of lake, although I want to start by setting a bar for performance because I have a rather low opinion of what can be achieved within lake itself and am exploring alternatives.

An import graph visualization would either be a one-time generation or a UI where a 1s startup cost is not a major overhead (things like Office and Adobe products and often VSCode have much longer startups).

Suppose, for the sake of discussion, that this program was not written in lean. I realize this might be hard to believe but there are other languages in the world, and some of them are quite good at graphing. Now this program wants to know the project structure, so it asks lake, only it can't because lake isn't telling. So now you have to write the program twice and make an adhoc communication protocol to get the data across. Not good. (Oh right, and one side of that has to be rebuilt every day, so you can't just hand an executable to users.)

For the server, we want to load the build functions as well and we want to do that once, so that is best solved via tighter integration between the LSP server and Lake.

Right, the only way you can actually talk to lake effectively is to be in-process and that's a problem, especially because lean is multi-process by design.

External build systems presumably also need to run the build functions so the static configuration is insufficient.

Yes, they would be calling lake when they need to run the build functions. But the point of a build system is that it can determine that in many cases things don't have to be run because the inputs haven't changed, so in many cases the configuration alone (and the dependency graph from the last run) is sufficient.

Finally, lake env seems like it would best to be fixed internally in Lake (by just caching the computed environment variables somewhere specific it can find them --- even potentially as a header of something like the lakefile.olean) and not be part of a public dump.

This is what I mean by this being an XY problem. This, as far as I can tell, does not appear to be the principled solution to any of the things it purports to solve at the moment. Furthermore, its absence does not make workaround impossible, just not first-class (which is to be expected, as they are not the principled solution).

I don't think you have thought through all of the things that would be required to create a competitor to lake. I have, because the performance issues drive me up the wall, but they are pretty deep issues.

Everyone is using lakefiles. Lakefiles use import Lake so you need the lake data structures and to some extent the lake data model.
Parsing lakefiles needs lean. Lean needs an expensive initialization, and every initialization has to happen in a separate process so you can't share work here either.
Lean changes every day. If you compile a program with one lean it won't work with any other, or at least trying it is playing russian roulette. Your users are using a wide variety of lean installs, usually multiple on a given day.
Determining what version of the software is in use requires elan. Unless you are one of the tools delivered by elan, you need to have a separate distribution system, and elan won't deliver that either so you kind of just have to make your own way.

This all adds up into some pretty strong vendor lock-in effect for affected repositories. There is no way a third party tool can exist in this kind of environment unless it is literally a fork of lake, and even then it's stuck with 3, lake changes every day too and god knows what the lakefiles are depending on because lake has no clearly demarcated stable API short of the whole program.

The easiest way to solve problem 3 is to get code bundled with the toolchain itself, because then whenever the user switches toolchains your code switches too. But then you have to deal with this soul-crushing PR process, or write an RFC, which I will remind you has not yet lead to positive results in any case I have seen so far. RFCs are just another way for the core team to reject PRs from my perspective.

However, it seems clear that you see otherwise. The point of the RFC process is to help elucidate the reasons behind a change in detail and discuss it with the wider community so as to avoid this kind of back-and-forth as we try to see things for each other's perspective with little insight into each other's designs.

I want license to be able to use lake's data structures outside lean. I know and have evidence to back up that this is significantly effective at improving performance. The more is done inside lean the more I have to deal with these politics.

What about it is implementation-detail? It's just a subset of the fields, and the fields being exposed are all quite natural (the list of dependencies of a package, the names of scripts, etc). Are these things you expect to change in the near future?

Alll of it, the dep/lib/exe/pkg configurations are all implementation-detail, Only the user-facing lakefile interaction with the options documented in the README is public.

The configuration is literally part of the lean_lib macro and is documented in the README, I don't understand how you can say it's not public. There is a list of requires and those have sources and there are package options and packages also recursively have more data. Is that something we can agree on as stable surface? You can access all of this directly using the DSL. I just want that info.

Configuration types and the way Lake stores and aggregates them is internal and can change at a moment's notice (and did change just recently with leanLibs/leanExes now being ordered).

I'm not sure what you mean by configuration types. As for leanlib ordering, you were the one who told me you wanted to make sure it was ordered. I don't care what order they are in. As far as whether this is public API, well it's clearly accessible through the DSL...

It is even possible for pre-processing step to be added between the external config users see and the internal config Lake sees (e.g., an InternLeanLibConfig or ExternLeanLibConfig).

This doesn't mean that the JSON schema needs to change. If some things become computed properties or fields that doesn't really matter. This is not supposed to be exposing lake's internals, it is exposing lake's public data model. It's just that because of the way lake is designed it's almost impossible to tell which is which, so you will have to tell me. Maybe I could write down the json schema and you can delete the things that you don't think should be there?

And this creates maintenance burden, causes trouble for users when things break, and creates an expectation of continued compatibility. There has to be a strong reason to add such unforeseen burdens. I really suggest you create an RFC about this and discuss it with the community and then come back with a strong consensus. That is the best way to ensure a change like this gets made.

Okay, so don't version it then. I couldn't care less. I'll have to adapt to it either way, but it's easier to handle multiple versions simultaneously if they are explicitly versioned.

RFCs on the lean repo don't work. I can make a zulip discussion, but creating a new issue will just be transplanting this conversation and will not convince anyone. We've been having this discussion for months and you already know my motivations. This is a backend detail anyway, the main user-facing upshot will be "cache gets 8x faster" which TBH isn't too hard to sell. There are no other users of the JSON interface that would be able to engage with the details of the proposal here.

Sep 15 '23 08:09 digama0

No, the end goal is for this to be a tool that sits where lake sits, mediating the download and compile process. It is not itself part of the build pipeline, any more than you would write require lake ....

I do not understand why this is requirement. Why can lean-cache not be used in the same way I used cache in mathlib4#6603. That would still provide the wonderful 8x speed-up of lean-cache. 🚀 Since this discussion has been a bit of downer, I do want to emphasize that I do think lean-cache is a great accomplishment! 🎉 I just am really confused as to why this PR should be the way to integrate it.

and some of them are quite good at graphing. Now this program wants to know the project structure, so it asks lake, only it can't because lake isn't telling.

Would providing lake graph be sufficient then? That may get you much (maybe all?) of the information you want without relying on the specifics of Lake's data model. Would that be good? Essentially, my concern in the design of an export format is that I want the format to provide some generic and generally useful schemaing that does not relying on the internals of Lake's data model (i.e., if you compared the schema with some from another Lean build system or a build system for another langauge it would be approximately similar).

The configuration is literally part of the lean_lib macro and is documented in the README, I don't understand how you can say it's not public. [I]t is exposing lake's public data model. It's just that because of the way lake is designed it's almost impossible to tell which is which, so you will have to tell me

I think the confusion here is that between Lake's data model and its DSL. The DSL is public. That is, the syntax with which you specify things is the public API. The data model this elaborates to (i.e., the configuration data model) is implementation detail. The other side of the data model that is semi-public is that used in scripts/targets (i.e., Package, LeanLib, LeanExe, etc.). However, that API is mediated through the utility definitions Lake provides (so the exact structure of the those types is still implementation detail). Does that make sense? I could mark a lot of things as private in Lake to help clarify this, but previous discussions indicating that private is generally just annoying. Would you like me to specific this somewhere either in the README or in the module documentation (or both). If so, how do you think that should look (there is much similar in the Lean core or mathlib to go by for a style on this).

I have, because the performance issues drive me up the wall.

I really do not understand this. Specifically, why your threshold seems to be ~0s. Most builds / tests for any project (in any language) take on the order of many minutes to complete (which is what Lake is primarily used for). I am not sure why a few second performance overhead is that concerning. I certainly would like to improve the performance were possible as time goes on, but I do not understanding why the current performance is such a deal-breaker / high priority.

But then you have to deal with this soul-crushing PR process, or write an RFC, which I will remind you has not yet lead to positive results in any case I have seen so far. RFCs are just another way for the core team to reject PRs from my perspective.

I am very sorry this is how it has felt for you. You are very valuable member of the community and contributor to Lean and you have a lot of good ideas. I think one problem is that you often have too many ideas, and we do not have the the time or the resources to maintain every one, and there are also substantial questions about what we should prioritize and what we should invest time in. I remember you mentioning how you are often away from working on one idea you had by 5 others you have had in the meantime. I sympathize with this (as I too have many ideas for Lean), but at the FRO, we have to be laser-focused on one at the expense of others (and that includes many of my own which are not priorities of the FRO).

Personally, I still really want to incorporate as many users ideas and requests into Lake as possible. However, with this PR and #2455, my reason for disputing these changes is that we already have plans for how we would like to address Lake+LSP and integrating cache/lean-cache and these PRs are not steps in those directions (as we have discussed previously). Thus, we would prefer not to pursue a line of changes and spend time maintaining them when we plan to go down a different road later (and thus cause trouble and inconsistency down the road when we have to shift users to the different approach). If our current plans are insufficient and need changing, that is something worth discussing, but that needs a very convincing argument to be made.

Also, in general, my impression of many of your suggestions for Lake is that many seem to going down a different route than Lake is currently headed. And your comments here seem to suggest you view them similarly (as you have talked about considering alternatives). The fact that you have a different vision is of course fine, but I think it should thus be obvious why such suggestions of a different vision do not often get traction.

There are no other users of the JSON interface that would be able to engage with the details of the proposal here.

This is part of the concern. An interface for exactly one product, even if it is a really good one like lean-cache 😀, is troublesome. However, we do want to figure out a solution to integrate lean-cache. My concern is still whether this s the right one (especially since the hope is to eventually integrate it via Lake, which would do away with all this). Regardless, I will discuss with the rest of the team next week and see what they think. They may think we should prioritize getting lean-cache running now even with a stop-gap solution.

Sep 15 '23 10:09 tydeu

@digama0

One other little minor note I forgot. lake env has already has a dump format! If you run a bare lake env with no args, it will dump the environment variables it sets as VAR=val lines. You can then then save this output to a file and then export them to save time if necessary (or just run lake env bash). For example, this would export all lake env variables to the current shell

while IFS== read -r key value; do printf -v "$key" %s "$value" && export "$key"; done < <(lake env)

Sep 15 '23 11:09 tydeu

There are no other users of the JSON interface that would be able to engage with the details of the proposal here.

This is part of the concern. An interface for exactly one product, even if it is a really good one like lean-cache 😀, is troublesome. However, we do want to figure out a solution to integrate lean-cache. My concern is still whether this s the right one (especially since the hope is to eventually integrate it via Lake, which would do away with all this).

See this is our point of disagreement. You think this is about lean-cache, but for me this is only the impetus and the main purpose is to have access to lake config from anywhere. If lean-cache was to be integrated in lake I would still want this data dump. One of the main things that makes lean4 so great and different from its predecessors is by putting the tools in the hands of the users to do whatever they want. I don't want you to build lake graph as a solution to this problem, because then it isn't addressing this extensibility angle, it's just one more thing that lake can do rather than a general mechanism allowing anyone to create anything. A JSON export is just the beginning, but it allows tools written by unaffiliated parties to interact usefully with lean.

I suspect that using this data dump you can just implement lake graph in user code. And that is a good thing. (It might still be useful to have it in lake as a convenience, but it is no longer a necessary task for the core team. This kind of effect is why this is actually a high-impact feature for relatively low cost.)

No, the end goal is for this to be a tool that sits where lake sits, mediating the download and compile process. It is not itself part of the build pipeline, any more than you would write require lake ....

I do not understand why this is requirement. Why can lean-cache not be used in the same way I used cache in mathlib4#6603.

Let's say that we were to set up lean-cache the way you suggest, where lake calls lean-cache instead of vice-versa. To do so, it would need to pass... A JSON data dump of the workspace config. Should I write the code for that into a lake script? That's exactly the code I'm trying to remove, I want it to be possible to use the cache with just a few lines of code (ideally zero but let's admit nonzero for your way), and I want it to not be mathlib-specific. The most obvious place to store the marshaling code for that config would be lake itself. Sound familiar?

Integrating lean-cache into lake itself is a separate project that I will look into once lean-cache is in the hands of users. That will involve many more PRs to core. The reason I initially abandoned that approach for lean-cache was because I didn't see how to hit those performance goals without radically re-architecting lake, which is difficult to do because of the unclear specification, as we've discussed previously.

Essentially, my concern in the design of an export format is that I want the format to provide some generic and generally useful schemaing that does not relying on the internals of Lake's data model (i.e., if you compared the schema with some from another Lean build system or a build system for another langauge it would be approximately similar).

Great, we agree. So help me do that. Tell me where you think that the export is overly specific. Surely we can come to some agreement about what should be in it which includes the basic necessities for knowing where things are.

I think the confusion here is that between Lake's data model and its DSL. The DSL is public. That is, the syntax with which you specify things is the public API. The data model this elaborates to (i.e., the configuration data model) is implementation detail. The other side of the data model that is semi-public is that used in scripts/targets (i.e., Package, LeanLib, LeanExe, etc.). However, that API is mediated through the utility definitions Lake provides (so the exact structure of the those types is still implementation detail). Does that make sense?

I understand this, but as a matter of necessity when writing this code the dump has to come from the elaborated data because that's where the data is. That most of the fields are just being read directly from the data structures is a coincidence (a bit more than a coincidence because there is a high correlation between the public view of the data model and the elaborated data structure).

Almost everything covered here is just the project config data, plus dependency information. If the DSL says that lean_lib has a roots field, then the roots field should be something we can publicly expose as part of this data dump. The list of scripts, dependencies, and targets are given as separate declarations but they are still all public concepts written directly in the DSL, and I don't think that "a package has a list of targets" is private information, although the exact data structure used to store the list of targets may be internal/unstable.

I could mark a lot of things as private in Lake to help clarify this, but previous discussions indicating that private is generally just annoying. Would you like me to specific this somewhere either in the README or in the module documentation (or both). If so, how do you think that should look (there is much similar in the Lean core or mathlib to go by for a style on this).

Actually I think private in lake definitions would be quite helpful, or alternatively leading-underscore names and/or clear documentation comments saying that specific definitions are for internal use only. I don't think the general proscription against using private in Init applies here, that issue is about proving properties of functions and this isn't relevant for programs just intended for execution. For lake it is much more important that you delineate the public parts because it functions as a library for scripts so it is important to control what they can do to prevent breakage and ossification.

I have, because the performance issues drive me up the wall.

I really do not understand this. Specifically, why your threshold seems to be ~0s. Most builds / tests for any project (in any language) take on the order of many minutes to complete (which is what Lake is primarily used for).

I assure you that drives me up the wall too. I have worked with projects where this kind of thing can be done 2 orders of magnitude faster, and this changes so much about how you work. Compiling lean takes long enough that I context switch every single time, and sometimes forget why it was even running by the time it completes. Compiling mathlib takes so long we essentially can't do it locally at all anymore. Compiling std is on the long side, and it is not nearly as long as other projects I have worked with of comparable compile time. Of course it will only get worse over time without active efforts to counteract the growth of libraries with performance improvements.

I am not sure why a few second performance overhead is that concerning. I certainly would like to improve the performance were possible as time goes on, but I do not understanding why the current performance is such a deal-breaker / high priority.

A tool that takes 1.8s to start up is not something I am comfortable putting on the hot path to opening a new file. You will say "yeah but lake has to do that anyway", and I will say "I want to fix that too". Most people will probably tolerate it but if you don't have people like me trying to drive these numbers down to zero I don't see why they would ever go down at all. It shows up in popular accounts as a general sluggishness, and people will fondly recall using other software where these issues didn't exist. I had to play the lean apologist just last week at AITP explaining why lean takes so long to respond to keypresses.

Sep 15 '23 11:09 digama0

@digama0

One other little minor note I forgot. lake env has already has a dump format! If you run a bare lake env with no args, it will dump the environment variables it sets as VAR=val lines. You can then then save this output to a file and then export them to save time if necessary (or just run lake env bash). For example, this would export all lake env variables to the current shell
while IFS== read -r key value; do printf -v "$key" %s "$value" && export "$key"; done < <(lake env)

This is great for speeding up lake env, but unfortunately not much else. In particular it lacks information about the existing libs and dependency structure, which is required for one of the more basic questions lake exe cache asks: Are we in mathlib, or in a dependent project? (Of course there are ways to approximate this question by making educated guesses about where lake-packages is, but you can fool this by setting some config args).

Sep 15 '23 11:09 digama0

Hello Mario (@digama0) and Mac (@tydeu),

Firstly, I want to thank all of you for the passionate work and ideas you bring to the Lean project.

Mario, I still recall when we met many years ago when you were a student working on MetaMath. Your contributions to Lean since that time have been invaluable, and I'm so glad I decided to give a talk at that small conference where we met. It saddens me to read that you're considering forking Lake. Such an action would create a rift in our community and could seriously impede the adoption of Lean. Over the years, I've tried to find projects that you would be excited about and that would add immense value to Lean. I remember spending an afternoon discussing the cbv tactic with you last year. I would love to see that idea come to fruition, as it could be a game-changer for the project. You're a brilliant contributor, and I believe that if you could focus your energies on key areas that are aligned with our roadmap, your impact would be immense.

However, I can see that emotions are running high and frustrations are surfacing. It's completely natural to feel this way, especially when dealing with complex software and limited resources. But let's remember the importance of constructive dialogue and mutual respect in making this community a welcoming place for everyone.

To Mario's concerns: I hear your frustration with the PR process and the limitations it poses. These are systemic issues that need to be addressed, not just for you but for every contributor who feels similarly stymied. Open-source development can be slow and bureaucratic, but it's often a necessary trade-off to maintain code quality and project direction. However, your point about a more agile approach is well taken.

To Mac's point: I appreciate your articulation of the resource constraints and the need for focus within the FRO. Deciding on project priorities is a complex task, and it's true that not every good idea can be pursued in the time and manner we'd all prefer.

I want to propose a more structured way forward that will address many of the concerns raised:

Monthly Community Meetings: We will be restarting our monthly meetings with the community starting next month. These meetings will serve as a platform to discuss and align on the overall direction of the Lean project. I believe this will help avoid redundant work and create a space where we can collectively decide on priorities.
RFC Process: During the first of these meetings, we will discuss how to potentially improve the RFC process to make it more responsive and effective. I am aware that the process has room for improvement, and your feedback will be invaluable here. I believe the missing piece is alignment with the Lean FRO roadmap and vision.
Project Roadmap: I share Mac's sentiment about the importance of a laser-focused approach for software development. Lean has secured funding for the next 5 years, and there are clear milestones that Sebastian and I aim to achieve during this period. This focus is crucial for securing long-term funding for the project. We will outline this vision clearly in our first monthly meeting to ensure everyone is on the same page.
Scaling Lake and Mathlib: In line with our long-term goals, we will also cover plans for Lake and discuss what is needed to scale Mathlib to 10M and even 100M lines of code. This is an ambitious but necessary goal, and we welcome all constructive input on how to achieve it.

Sep 17 '23 14:09 leodemoura

It saddens me to read that you're considering forking Lake. Such an action would create a rift in our community and could seriously impede the adoption of Lean.

I think this is the wrong perspective. I think we should embrace a diversity of implementation strategies, and consider thinking beyond the local maximum we happen to find ourselves in. It is possible that lake may not be the best package manager long-term for lean, and we should permit investigation of alternative strategies. Honestly I have been a little unsure on how to even broach this conversation, because we never really did a major design discussion on lake / package management in general with the community and right now the perspective is significantly colored by the way lake has chosen to do things.

An alternative package manager won't cause a rift in the community until/unless it is deployed widely, and hopefully by the time any alternative gets to that stage we can have the conversation and come together on what fits best for the needs of the community, rather than just shutting down the conversation and forcing a decision through the current distribution system.

However, I can see that emotions are running high and frustrations are surfacing. It's completely natural to feel this way, especially when dealing with complex software and limited resources. But let's remember the importance of constructive dialogue and mutual respect in making this community a welcoming place for everyone.

To Mario's concerns: I hear your frustration with the PR process and the limitations it poses. These are systemic issues that need to be addressed, not just for you but for every contributor who feels similarly stymied. Open-source development can be slow and bureaucratic, but it's often a necessary trade-off to maintain code quality and project direction. However, your point about a more agile approach is well taken.

Yes, I am frustrated. Keep in mind that while I am frustrated and do my best argue my case, others may simply avoid lean 4 core altogether as a result, and I think we are as a whole worse off for it. I do hope we can find some kind of approach here that still allows space for drive-by contributions, without simply stone-walling anything that isn't on the top priority roadmap. Because quite frankly, I don't think external contributors can even meaningfully contribute to those high-priority areas most of the time because they are usually also high-context and probably can't be safely outsourced. Little paper cuts on the edges are where external contributors do best.

I want to propose a more structured way forward that will address many of the concerns raised:

I am in favor of all of the points listed. I look forward to the discussion re: the RFC process and hope we can find something that works for everyone.

Sep 17 '23 20:09 digama0

Dear Mario (@digama0),

Thank you for your insightful response. I appreciate your openness to dialogue on these pressing matters.

Regarding your thoughts on embracing diverse implementation strategies, I understand the perspective, but I believe it's critical to consider the potential pitfalls of fragmentation. We have ambitious plans for Lake at the Lean FRO, including features like fine-grain cloud builds for all Lean packages, Reservoir (our equivalent to crates.io), and a daemon-based build system capable of scaling to tens of millions of lines of code and hundreds of packages. It's also worth noting the amount of positive feedback we've received about Lake. I have many colleagues at AWS that have praised how smooth Lake is. While it isn't perfect, Mac is diligently working on its improvement. Having initially built the system as a volunteer, he now works full-time on it as an employee at the Lean FRO. I should also remind you that Sebastian and I reached out to many community members, including you, for help building a package manager for Lean 4. Mac stepped up to the challenge, delivering a system that works remarkably well under difficult conditions. I recall many people assumed Lake would not be able to build Mathlib, but it succeeded. Given the long-term commitment and coordination required, I genuinely believe that working on an alternative build system would be an inefficient use of your skills and expertise. I didn't forget about the cbv tactic that we discussed last year, and you seemed so excited about ;) I believe Lake is better suited for someone who is working on it full-time, is passionate about the project, and is being paid to contribute.

Regarding your frustrations, they're entirely understandable. Your points about the challenges external contributors face, especially when it comes to "paper cuts," are well-taken. However, I can't help but notice that some of your PRs appear to be in limbo, tagged as "awaiting author" after we've requested modifications, and without any further action on your part.

I'm pleased to hear you support the structured path forward that has been proposed. I believe a well-defined RFC process could serve as an effective channel for the community's creative energy, ensuring that we stay focused on our key goals.

I look forward to engaging in more in-depth discussions about the RFC process and am hopeful that we can find a solution that suits everyone. Your ongoing contributions to the Lean community are invaluable, and I'm optimistic that through constructive dialogue, we can resolve the issues at hand.

Sep 17 '23 22:09 leodemoura

While it isn't perfect, Mac is diligently working on its improvement. Having initially built the system as a volunteer, he now works full-time on it as an employee at the Lean FRO. I should also remind you that Sebastian and I reached out to many community members, including you, for help building a package manager for Lean 4. Mac stepped up to the challenge, delivering a system that works remarkably well under difficult conditions. I recall many people assumed Lake would not be able to build Mathlib, but it succeeded. Given the long-term commitment and coordination required, I genuinely believe that working on an alternative build system would be an inefficient use of your skills and expertise.

I was not and still am not in a position to build and maintain a build system for lean. I'm definitely glad that Mac stepped up and did this, and it has been a huge help for everyone involved. At the same time, there are some specific technical issues that I can identify solutions for, as well as some larger directional discussion I would like to have at some point, based on discussions I have had with others in public and private regarding what we need from a build system. Forking lake and trying to compete with it is definitely not a good direction, but neither is considering the discussion over before it has begun. If lake2 is also developed at the FRO, I'm happy with that. Maybe lake will just get some internal rearchitecture and no user visible changes will happen. Or maybe the frontend (the lakefiles) will change but lake will retain its overall structure. I don't want to consider any option as off the table for the purpose of discussing what the issues are, what the goals are, and what we can do to address them.

I didn't forget about the cbv tactic that we discussed last year, and you seemed so excited about ;) I believe Lake is better suited for someone who is working on it full-time, is passionate about the project, and is being paid to contribute.

Right now, my lean cycles are mostly occupied maintaining std, which is seeing an uptick in PRs and has a growing backlog. I'm hopeful that we can find another maintainer soon.

However, I can't help but notice that some of your PRs appear to be in limbo, tagged as "awaiting author" after we've requested modifications, and without any further action on your part.

Yes, I have been remiss in not addressing issues on my other PRs to lean core. I don't blame you or the team for any of that, it is on me to address the recommendations. (If you or anyone else wants to prod me to finish up a specific PR, feel free to ping me about it.)

Sep 18 '23 04:09 digama0

Thank you for elucidating your thoughts on the future of Lake and the need for ongoing discussion. I agree that dialogue should continue, and I also believe that it's essential to give credit where it's due. One of the guiding philosophies that I adhere to in software development is the concept of code ownership. Mac has done an incredible job building Lake from scratch, initially as a volunteer and now as a full-time member of the Lean FRO. This kind of sustained commitment and excellence is rare, and I have immense respect for his work. In my view, Lake is essentially Mac's creation, and as such, he should have the final say in its technical direction. That's not to shut down discussions or suggestions for improvement; quite the opposite. Everyone's input is valuable, especially when it comes to identifying potential issues or proposing constructive solutions. However, it's also important to respect the vision and decisions of the person who has taken the primary responsibility for a project, and in the case of Lake, that person is Mac. I've been extremely pleased with how he has been addressing issues and shaping the future of Lake. His vision aligns closely with what I believe will best serve the needs of the Lean community, and I trust his judgment in making the right technical decisions. I promise you that I will address this point again during our first monthly community meeting.

Right now, my lean cycles are mostly occupied maintaining std, which is seeing an uptick in PRs and has a growing backlog. I'm hopeful that we can find another maintainer soon.

Thank you for letting me know about your current focus on maintaining Std and the increasing workload it involves. Your dedication as a volunteer in this area is invaluable to the community. Rest assured, Sebastian and I are taking steps to find a full-time maintainer to help manage the growing demands of the project.

Sep 18 '23 12:09 leodemoura

lean4 lean4 copied to clipboard

feat: lake: add `lake dump-config` command

lean4
lean4 copied to clipboard