flit icon indicating copy to clipboard operation
flit copied to clipboard

Switching to flit - a lot of unneeded stuff like .github ends up in the sdist

Open pfmoore opened this issue 3 years ago • 12 comments

I have a project which until now has been managed with setuptools. I'm considering switching to flit. However, if I try to do so, I end up with a lot of files I don't want in the sdist.

What I did was as follows:

  1. git rm the files setup.py, setup.cfg, pyproject.toml
  2. flit init, answering the various questions
  3. git add pyproject.toml
  4. Remove some stuff from the local directory because flit build was getting cross about untracked files
  5. flit build

Note that I did not commit these changes - I want to test the conversion before doing that.

But the sdist includes files like .github/workflows/*.yml, and .gitignore. I checked the configuration for the packaging==20.5 project (when it used flit), and it didn't seem to do anything special, and yet its sdist didn't have these files.

I know I can put includes and excludes in the sdist section of pyproject.toml, but I'd have expected the default behaviour to exclude commonly-used admin files like these.

Am I doing something wrong here?

(Update: I just checked, and even with a new project I get these files 🙁)

pfmoore avatar Mar 31 '21 15:03 pfmoore

Update: It looks like flit build gives different results than the PEP 517 build_sdist hook (if I run py -m build .)

pfmoore avatar Mar 31 '21 15:03 pfmoore

Yup, you're right that flit build does something a bit different from the PEP 517 hook. And this is definitely a bit awkward. A brief bit of history:

The first versions of Flit didn't make sdists at all, only wheels. For pure Python code, a wheel is still 'source', and if you wanted the docs & support code at all, you could clone the source repo. That upset people, so I added sdists, doing something similar to git archive as a snapshot of whatever was in the repository. I thought that only the maintainer would need to generate these at release time, so relying on git (or another VCS) to decide what to include seemed reasonable.

Then PEP 517 came along, and said that making sdists is something that 'backends' should be able to do whenever frontend tools ask, not just something developers do 'manually'. This didn't really fit with what I'd developed, and I pushed for a concession that backends could decide it's unsupported, but it still required a rethink.

The result of that rethink was that the PEP 517 build_sdist hook will make a minimal sdist without using git - it should always be enough to work for installation, but may not be what you want to upload. flit build and flit publish continue to work like they did before PEP 517, getting a list of files from git, and that's what I'd normally upload. The include/exclude patterns specified in pyproject.toml are respected by both.

This probably isn't the last time this setup will evolve. :slightly_smiling_face:

takluyver avatar Apr 01 '21 09:04 takluyver

Thanks for the explanation. To be honest, I still think having two ways to build a sdist that give different results is a bad choice (and I'm afraid I'm very much in the "sdists are fundamental, you must publish them" camp), but I understand how things ended up the way they are. (And of course, there's no reason you have to agree with my views 🙂)

I'm inclined to say that at a minimum, flit build should ignore "obvious" things like .github, .gitignore and .pre-commit-config.yaml. But I can appreciate that there would be a lot of guesswork getting that right.

I'll see if I can make things work the way I want using include/exclude patterns.

pfmoore avatar Apr 01 '21 09:04 pfmoore

I haven't, and probably won't, hardcoded any default exclusions. There might be useful clues about how the project is built or tested in something like a CI workflow definition, and the impact of including these files in sdists is typically low, both because they're usually small files and because the majority of users will probably download the wheel. I'd rather include everything in an sdist by default and let the developer decide what to exclude.

I still think having two ways to build a sdist that give different results is a bad choice

Yeah, I'm not very happy with it. But I didn't want to break how flit build worked, and I didn't want the PEP 517 hook to depend on being able to run git/hg. This was the least bad option I could find at the time.

takluyver avatar Apr 01 '21 09:04 takluyver

I spent some time thinking about this recently, and here's my proposal for how to handle the inconsistency around how sdists are generated between the two build-backends and bring them into alignment:

  • When there's a .git or .hg folder in the source tree (root), attempt to use the ignore information from git/hg (+ user excludes) and fail if the executable is not present.
  • Don't bundle these folders into the sdist perhaps obviously.
  • When there's no such folder in the source tree (root), bundle all the files (respecting user excludes).

This would mean that both the backends would depend on VCS being present when building from a VCS checkout (which is a safe assumption) and would automatically ignore files that are in the VCS.

I think #625 uses the wrong model, since it's not a choice being made when building, but rather a choice based on what information is available to the build logic -- I think Flit should have VCS-related enabled by default (when we have VCS metadata) and also handle things gracefully when that metadata is not available. I don't think it's a case of "don't use vcs metadata always" or "use vcs metadata always", which is what the two backends do today.

By removing the ability to build something that you might not want to upload, we eliminate the issue being flagged here and make sdists serve a single concept: being a redistributable source archive that contains exactly what the git repository contains and nothing more by default, with additional exclude rules excluding more files as necessary.

With this, I'd also say that we should have the two backends move to having the same behaviours, and basically the flit backend becoming a re-export of flit-core.

pradyunsg avatar Feb 28 '23 12:02 pradyunsg

FWIW, the "has a .git or .hg directory" case is exactly how flit build works already. So I'm not sure what

Yeah, I'm not very happy with it. But I didn't want to break how flit build worked, and I didn't want the PEP 517 hook to depend on being able to run git/hg. This was the least bad option I could find at the time.

means. Is this trying to accommodate for the case where the source tree is a VCS clone, but the VCS program is not installed? If so, how did you get the clone? Copy it via USB drive, .git/ or .hg/ directory intact, from another machine that did have the VCS program installed?

eli-schwartz avatar Feb 28 '23 12:02 eli-schwartz

I'm inclined to say that at a minimum, flit build should ignore "obvious" things like .github, .gitignore and .pre-commit-config.yaml. But I can appreciate that there would be a lot of guesswork getting that right.

Incidentally, meson solves this by ignoring anything that has been ignored by a gitattributes (git check-attr) "export-ignore" rule.

eli-schwartz avatar Feb 28 '23 12:02 eli-schwartz

I spent some time thinking about this recently, and here's my proposal for how to handle the inconsistency around how sdists are generated between the two build-backends and bring them into alignment:

I'm confused here, as there aren't two backends, there's just flit_core, as far as I know. What there are, is two frontends - flit build and the PEP 517 build_sdist hook.

I remain strongly of the opinion that the key thing here is for those two mechanisms to build the same sdist, as otherwise it's way too easy to accidentally publish a wrong sdist just by using the incorrect command to build it. But that's just my preference as a user, and in my personal workflow, "what's in VCS" is not necessarily related to "what I want in my Python project".

Having said that, though, my current position is that flit_core is a perfectly OK build backend, but I won't use the flit frontend command, and any public projects of mine that ever switch to flit_core will explicitly note that using the flit frontend is not supported. I'm slightly sad that I have to take this position, because it feels like I'm somehow saying the flit frontend is bad, but it seems like a better solution than badgering the flit developers to take their frontend in directions that they don't want to go.

pfmoore avatar Feb 28 '23 13:02 pfmoore

s/build backends/build mechanisms/ 😅

pradyunsg avatar Feb 28 '23 13:02 pradyunsg

When there's a .git or .hg folder in the source tree (root), attempt to use the ignore information from git/hg (+ user excludes) and fail if the executable is not present.

I'd ruled this option out in the past, and I still don't much like it, because I think it would fail in realistic scenarios that we should support, e.g. bind-mounting a git repo into a Docker container without git and building there, or local development by someone who uses git/hg through a GUI interface and doesn't have the command on PATH.

Then the next idea that inevitably comes to mind is that rather than failing without git (or hg) on PATH, it should go into a fallback mode, like flit build currently does. But then the sdist that it makes depends on a barely-visible external factor, which is bad. I was just about happy with that trade-off for the flit command line tool, meant to be used directly by developers and with control of its own output to show warning messages. I don't think it's a good fit at all for a backend which is called by other tools.

Yet another possibility is to bypass the executables and read the VCS metadata directly. But it still means doing something different depending on the presence of hidden files, which I don't much like. And accurately parsing gitignore/hgignore files is bound to be fiddlier than it looks. E.g. do we apply the user's global gitignore file in addition to the one in the repo? What if part of the package is in a submodule?

When there's no such folder in the source tree (root), bundle all the files (respecting user excludes).

This is a possibility I hadn't considered - when we're not using VCS metadata, we could default to including everything in the folder where pyproject.toml lives and let people exclude things they don't want. If we separate this from the point above, we'd presumably have to special case .git and .hg directories to avoid packaging the project's history.

On the one hand, people wouldn't need to open issues asking for docs or tests to be included in the sdist. On the other hand, I'd definitely end up accidentally packaging random scripts I have lying around, built HTML docs, etc. And it very much wouldn't solve @pfmoore's original complaint in this issue that the sdist includes unnecessary files.

I'm inclined to say the minimal-sdist-plus-includes is still a better starting point than everything-minus-excludes. Do you think this is worth discussing further?

By removing the ability to build something that you might not want to upload, we eliminate the issue being flagged here and make sdists serve a single concept: being a redistributable source archive...

This aligns with my thinking about sdists before PEP 517. But PEP 517 formalised that sdists have another purpose: we build them as an intermediate step on the way to wheels, even if we're not intending to distribute them. I argued against this at the time, and the UnsupportedOperation caveat was added to appease me, but even with that, there are unavoidably two purposes for sdists.

So my stance until recently was that you should use flit build to generate the 'distribution' sdist, and flit_core when e.g. making an sdist as a step of pip installing from source. But that has gradually become untenable, as people expect e.g. python -m build to make an sdist for distribution.

badgering the flit developers to take their frontend in directions that they don't want to go.

To be clear, I still think #625, and later making --no-use-vcs the default, is the best option I can see for this.

takluyver avatar Mar 01 '23 12:03 takluyver

Yet another possibility is to bypass the executables and read the VCS metadata directly. But it still means doing something different depending on the presence of hidden files, which I don't much like. And accurately parsing gitignore/hgignore files is bound to be fiddlier than it looks. E.g. do we apply the user's global gitignore file in addition to the one in the repo? What if part of the package is in a submodule?

As a general build backends thing, you don't want .gitignore files anyway, since their only purpose is to list files which shall not be checked into git, and those are usually exactly the files you do want to distribute publicly (build artifacts). It's useless for files which are checked into git (and thus by definition not ignored, so to check them into git you actually have to write a more specific gitignore rule that cancels out the more general ignore).

The VCS metadata you do want to parse is the repository object packfiles to find the tree object for the current commit hash from HEAD that lists all files contained in that tree, which umm, yeah, you very much do not want to have to parse manually. :D

On the one hand, people wouldn't need to open issues asking for docs or tests to be included in the sdist. On the other hand, I'd definitely end up accidentally packaging random scripts I have lying around, built HTML docs, etc. And it very much wouldn't solve @pfmoore's original complaint in this issue that the sdist includes unnecessary files.

I think it would solve the original complaint, because people who cared would simply keep git installed. Alternatively...

I argued against this at the time, and the UnsupportedOperation caveat was added to appease me, but even with that, there are unavoidably two purposes for sdists.

... if flit_core detects that git is needed for this particular pyproject.toml project, it could raise UnsupportedOperation('git needed but not available') and tools that build wheels should fall back to building them directly, and tools that specifically build sdists just have to deal with that by installing git.

Question:

I think it would fail in realistic scenarios that we should support, e.g. bind-mounting a git repo into a Docker container without git and building there, or local development by someone who uses git/hg through a GUI interface and doesn't have the command on PATH.

Won't the GUI interface inevitably have a real git or hg installed, though? Either by a linux package manager dependency, or by git for Windows making it easy to stick the git binary in PATH by default?

For docker containers, maybe people should simply not do that. What's the temptation to do that, and then also build specifically an sdist?

eli-schwartz avatar Mar 01 '23 12:03 eli-schwartz

even with that, there are unavoidably two purposes for sdists.

I agree, there's a definite tension here over what people expect a sdist to be "for". I'm somewhat intermediate between the two cases you describe - I don't think of a sdist as "just" what's needed to build a wheel, but neither do I think it should be a bundle of everything in the development tree. I think "a buildable set of source" is a viable concept in its own right (even though what people mean by that may differ) and trying to force it to match "the project development tree" is just as bad as making it "the bare minimum to build a wheel".

I guess that given the above, I think that as it's the project maintainer who decides what fits in their concept of a "source distribution", reasonable defaults plus the ability to define what you want is the right choice here. My instincts are for a minimal default with anything else added explicitly, but I can see the attraction for some users of a maximal default with explicit exclusion.

To be clear, I still think https://github.com/pypa/flit/pull/625, and later making --no-use-vcs the default, is the best option I can see for this.

Agreed. That gives two "reasonable defaults" to start from, with the "default default" being the one that doesn't need additional tools available.

pfmoore avatar Mar 01 '23 12:03 pfmoore