BizHawk icon indicating copy to clipboard operation
BizHawk copied to clipboard

Organisation of subprojects

Open YoshiRulz opened this issue 8 months ago • 17 comments

tl;dr: I'm calling for a moratorium on copying third-party code into the repo, and I'm proposing that we move a few dirs around.

We all hate code duplication. So why is it okay for us to needlessly copy entire codebases? ~~snip~~ (Cut out a bunch of back-and-forth here since I feel none of you need to be convinced this is a good idea, only that it's worth the effort to do it. edit: Apparently natt did need convincing. Scroll down for arguments.)

So let's assess the damage:

  • /ExternalCoreProjects/Virtu, Virtu core, < 11 kLOC, upstream is https://github.com/digital-jellyfish/Virtu/tree/master/Virtu
  • /ExternalProjects/NLua, Lua host, < 10 kLOC, upstream is https://github.com/NLua/NLua
  • /ExternalProjects/iso-parser, disc system, < 2 kLOC, upstream was on Google Code
  • /blip_buf, multiple cores, < 2 kLOC, upstream was on Google Code
  • /libmupen64plus, Mupen64Plus and plugins, < 340 kLOC, upstreams are under https://github.com/mupen64plus
  • /lynx, Handy core, < 12 kLOC, upstream is in tarball (or https://github.com/TASEmulators/mednafen/tree/master/src/lynx)
  • /psx, Octoshock core, < 68 kLOC, upstream is in tarball (or https://github.com/TASEmulators/mednafen/tree/master/src/psx)
  • ~~/quicknes~~ replaced with submodule
  • /waterbox/ares64/ares, Ares64 core, < 55 kLOC, upstream is https://github.com/ares-emulator/ares
  • /waterbox/bsnescore/bsnes, (new) BSNES core, < 67 kLOC, upstream is https://github.com/bsnes-emu/bsnes
  • /waterbox/gpgx, Genplus-gx core, < 88 kLOC, upstream is https://github.com/ekeeke/Genesis-Plus-GX
  • /waterbox/libsnes, (old) BSNES core, < 63 kLOC, upstream is https://github.com/bsnes-emu/bsnes
  • /waterbox/nyma/zlib, zlib for Nyma cores, < 19 kLOC, upstream is https://github.com/madler/zlib
  • /waterbox/picodrive, PicoDrive core, < 79 kLOC, upstream is https://github.com/notaz/picodrive
  • /waterbox/tic80, TIC-80 core, < 17 kLOC, upstream is https://github.com/nesbox/TIC-80
  • /waterbox/uzem, Uzem core, < 5 kLOC, upstream is https://github.com/Uzebox/uzebox/tree/master/tools/uzem
  • /waterbox/virtualjaguar/src, Virtual Jaguar core, < 52 kLOC, upstream was on a self-hosted Git server
  • /wonderswan, Cygne core, < 8 kLOC, upstream is in tarball (or https://github.com/TASEmulators/mednafen/tree/master/src/wswan)

In total, the upper bound is 917 kLOC of source code, severed from its trunks and waiting to bitrot. (For reference, just the BizHawk solution is ~450 kLOC, and Mesen is ~325 kLOC.) Not listed are several binaries in /Assets/dll which are rundeps for unmanaged cores. I'll be looking at these as part of my Nix experiments. I also discounted dependencies of these projects whenever I noticed e.g. .../vendor near the bottom of the filesize list—but they are nonetheless checked-in to this repo.

I propose that from today we only use Git submodules for this purpose:

  • if possible, pulling in the upstream and only checking-in build scripts and patches, or else having a mirror/fork in the TASEmulators org (TODO copy and expand https://github.com/TASEmulators/BizHawk/commit/2af6bf3c34f8a31690fe7e9d8500e54bd01c671d#commitcomment-140199736); and
  • placing submodules in /submodules, and supplemental files like Makefiles or shell scripts in one of /ExternalProjects, /ExternalCoreProjects, or /waterbox when those are necessary.

I also propose we migrate all subprojects, starting with those listed above, to the same scheme. Adding a submodule and removing the checked-in copy can be squished into a single commit.

Moved here from my personal notes #118; see also #2423, my personal notes #11, and #2312.

YoshiRulz avatar Oct 31 '23 18:10 YoshiRulz

Bonus points if you manage to not lose our change history to those projects when moving them to submodule. Splitting folder history is possible in git, but re-applying changes from a different tree (even if the code matches) may be harder.

vadosnaprimer avatar Oct 31 '23 19:10 vadosnaprimer

For many of those cores/projects, they're checked in as they have been forked and modified, some fairly lightly (e.g. tic80, which mostly just has waterbox interfacing added and minor adjustments and unneeded files excised) and some very heavily (e.g. virtualjaguar, much of which is unrecognizable compared to upstream), not to mention just ones extremely outdated from upstream (e.g. gpgx). Most of them will require forking in any case and can't just have upstream submoduled.

fork in the TASEmulators org

To be clear this is also mainly the reason I don't go immediately fork in TASEmulators org for projects image And getting someone to move a repo to the org is annoying. Way eaiser to just clone into the repo and call it a day, and no downsides (besides BizHawk clone size) if upstream is dead.

CasualPokePlayer avatar Nov 01 '23 03:11 CasualPokePlayer

I didn't know lack of permissions was a roadblock for more cool work. I think that can be arranged.

vadosnaprimer avatar Nov 01 '23 17:11 vadosnaprimer

To be clear this is also mainly the reason I don't go immediately fork in TASEmulators org for projects [image showing insufficient permissions]

This would also be a blocker for me. I do see the advantage in terms of organization if more stuff was in submodules, but it does require all maintainers to have access to all the relevant projects.

Cleaning up the root folder of the bizhawk repo is independent from that though, and I'd definitely approve moving stuff to a unified folder. There is ExternalCoreProjects, why isn't it used more?

Morilli avatar Nov 02 '23 12:11 Morilli

If people tell me what to fork under tasemus, I can do that and invite you guys as admins to those forks. I may also discuss adding more admins to the org itself...

vadosnaprimer avatar Nov 03 '23 15:11 vadosnaprimer

Well, most of the repos I linked should be forked, but pragmatically the larger ones are far more important, and as CPP said some projects have had more extensive changes than others. IMO forks can be made "as needed" i.e. the next time someone wants to update a core.

YoshiRulz avatar Nov 03 '23 20:11 YoshiRulz

What's the objective here; making it easier to merge upstream changes? In that case, the difficulty comes chiefly from how much the code has diverged from upstream, not the mechanism by which they're stored.

It's a tradeoff we've made different ways different times: Heavy modifications to source lets us get off the ground with something that works for us more quickly. Light modifications to source mean we have to do work that's sometimes more awkward and difficult to integrate with the existing code better, but we can take upstream changes more easily.

I don't think there's one clear answer. The cores that sync with upstream seem to be the better long-term bet, and we can see that especially in cases where we've done it both ways and can compare the two (e.g., early Mednafen efforts vs the Nyma system.) But we also have to consider that some of these ported cores might not even exist if we hadn't taken the "easy" way out.

I'm against any specific requirements here, because we're all volunteer and the most important thing is being useful for the person who will actually work on the core. I think core porters should consider the value of various approaches, but if they want to get hacking, and they're the only one who will do it, let them.

nattthebear avatar Nov 11 '23 22:11 nattthebear

To be clear, this issue is not arguing against the heavily-modified path. I'm just against adding hundreds of files directly to this repo.

YoshiRulz avatar Nov 11 '23 22:11 YoshiRulz

I don't understand why that's worth discussion. Apart from concerns on implementing the cores, and concerns about keeping the cores synced with upstream, why does it matter?

nattthebear avatar Nov 11 '23 22:11 nattthebear

For one, having an extra million LOC in the repo drastically increases the time it takes to clone. That affects both humans and CI. It also makes grep -r, find, etc. take longer and give more false positives.

YoshiRulz avatar Nov 11 '23 22:11 YoshiRulz

I guess; aren't most devs going to check out most sobmodules anyway?

nattthebear avatar Nov 11 '23 23:11 nattthebear

aren't most devs going to check out most submodules anyway?

I for one only ever checkout the submodules I require. That said, I don't think the C/C++ cores in the repo contribute much to the clone size; it's probably going to be the assets folder for the most part.

Morilli avatar Nov 11 '23 23:11 Morilli

I guess; aren't most devs going to check out most sobmodules anyway?

Seeing as you don't need the submodules cloned to build the solution, and we don't really advertise their existence, I'd imagine no, most devs aren't cloning the submodules. Like Morilli, I never bother to outside of the rare occasions I need to work with them.

That said, I don't think the C/C++ cores in the repo contribute much to the clone size; it's probably going to be the assets folder for the most part.

Downloading edf5f1513 as a zip (not the same as cloning, but similar, and easier to measure): Screenshot_20231112_183427 So your guess was broadly correct. (IMO that is also a problem; Citra and MAME together make up more than half of /Assets' filesize. See #3505.) But if you omit /Assets, the remaining filesize is split at about a quarter /waterbox, a quarter /src, a quarter /libmupen64plus, and the long tail. Notably, some of those largest white segments are .exes, > 18 MiB combined? Those sound like easy wins.

YoshiRulz avatar Nov 12 '23 08:11 YoshiRulz

Well if we're doing this, here's a probably at least somewhat accurate and representative list of what contributes most to the .git size:

Git object size list
Root folder Total object size
Assets 868.6MB
output 749.5MB
BizHawk.MultiClient 524.1MB
BizHawk.Client.EmuHawk 418.5MB
src 253.2MB
BizHawk.Emulation.Cores 172.9MB
ppsspp 108.4MB
waterbox 75.3MB
BizHawk.Emulation 66.3MB
References 61.4MB
BizHawk.Client.Common 56.8MB
libmupen64plus 54.7MB
output64 48.4MB
BizHawk.Client.EtoHawk 45.7MB
mednafen 30.4MB
23.2MB
psx 13.5MB
yabause 13.4MB
libgambatte 10.9MB
genplus-gx 10.6MB
citra 10.1MB
Dist 9.7MB
vbanext 8.5MB
LuaInterface 8.4MB
BizHawk.Emulation.Common 7.9MB
libHawk 7.7MB
Tools 7.3MB
BizHawk.Emulation.DiscSystem 5.0MB
libsnes 4.8MB
libmednahawk 4.2MB
Bizware 4.1MB
ExternalCoreProjects 4.1MB
BizHawk.Common 3.8MB
BizHawk.Client.MultiHawk 3.1MB
ExternalToolProjects 2.1MB
Version 1.9MB
genplus-gx32 1.6MB
CpuCoreGenerator 1.6MB
attic 1.5MB
BizHawk.Client.ApiHawk 1.4MB
quicknes 1.3MB
lynx 1.2MB
ExternalProjects 1.2MB
wonderswan 1.1MB

Initial list was generated using git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | sed -n 's/^blob //p' | sort --numeric-sort --key=2 | cut -c 1-12,41- | $(command -v gnumfmt || echo numfmt) --field=2 --round=nearest > objects.txt and then grouped by root folder.

Morilli avatar Nov 12 '23 14:11 Morilli

we don't really advertise their existence

They're in .gitmodules, what other advertising would be needed?

Anyway, that download size is awful, but the problem there is mostly unrelated to submodules and this ticket, as it's about prebuilt binaries. We'll need to get all of our crusty binary build setups into an easily reproducible build system.

nattthebear avatar Nov 12 '23 14:11 nattthebear

placing submodules in /submodules, and supplemental files like Makefiles or shell scripts in one of /ExternalProjects, /ExternalCoreProjects, or /waterbox when those are necessary.

A minor problem with doing this specifically is cmake being used in places, and cmake does not allow including stuff in parent directories, you can only include stuff in the current directory and subdirectories. So unless you end up also having the cmakelists also be within the /submodules folder alongside the submodule, or possibly in the BizHawk root directory itself, you're screwed.

CasualPokePlayer avatar Dec 09 '23 10:12 CasualPokePlayer

That's dumb. And I assume you need CMake on Windows too, so you can't just use a symlink, bind mount, etc. to work around it.

YoshiRulz avatar Dec 09 '23 20:12 YoshiRulz