perl5 icon indicating copy to clipboard operation
perl5 copied to clipboard

[doc] Setting up PerlIO callbacks when embedding a Perl interpreter using C (before any `*.pm` modules were ever loaded or any Perl code executed)

Open vadimkantorov opened this issue 1 year ago • 5 comments

Hi!

I managed to do a fully hermetic single-file static build of perl via building all modules statically (followed https://perldoc.perl.org/perlembed) and providing my own implementations of open/fopen/read/seek to serve *.pm system files from memory.

Is there a way to hook up to the Perl's own PerlIO layers system to make sure that Perl only calls these functions (including for module/*.pm discovery and loading) and never goes to libc's IO functions or does libc IO function calls / IO syscalls? This would be much cleaner and a more robust solution.

It would be nice if setting up PerlIO in perlembed scenario was covered in docs.

I also wonder how diamond operator is implemented in the code and which functions from https://github.com/Perl/perl5/blob/blead/perlio.c it calls and in what sequence (e.g. for perl -e 'open(f,"<","my.txt");print(<f>);' and for perl -e 'open(f,"<","my.txt");$line=<f>;print($line);')

Thanks!


If anyone's curious to see what my hack looks like - https://github.com/vadimkantorov/perlpack, but it's very much a WIP

My current problem is that overriding open /close / read / stat / lseek / access / fopen / fileno was sufficient for perl -e 'use Cwd;print(Cwd::cwd(),"\n");', so it can successfully discover and load the Cwd.pm file from my virtual read-only FS, but doing perl -e 'open(F,"<","/mnt/perlpack/.../Cwd.pm");print(<F>);' does not work - probably because Perl is trying to do fcntl/ioctl/some other version of stat call and I am not implementing these. In any case, it is currently not invoking the read function for some reason when I'm using the diamond operator because of some failures on the way. Which IO/stdio calls are used by Perl in a typical opening/reading a file? strace shows open -> fcntl -> ioctl -> lseek -> fstat -> mmap -> read, but these are raw syscalls, so I'm wondering what are the concrete libc/stdio IO functions (I imagine this is somewhere in perlio.c or do_io.c but there are quite a few of indirection layers - so hard to parse through by a novice in the perl's codebase) are used by Perl in a typical opening/reading a file (e.g. stat has many variants) - so that I can override them.

vadimkantorov avatar Sep 04 '24 17:09 vadimkantorov

There is PERL_IMPLICIT_SYS, but that replaces all I/O (not just module loading) and only has a host implementation on Windows.

If you just want modules to be loaded from memory you can add a hook to @INC that checks for a known name and loads that module from memory, see perldoc -f require.

tonycoz avatar Sep 05 '24 01:09 tonycoz

I'll check out PERL_IMPLICIT_SYS - replacing all I/O is fine for my usecase, as my custom I/O functions only serve from in-memory for some special prefixes like /mnt/perl. Is anywhere any docs / examples of using PERL_IMPLICIT_SYS to override? (and what functions need to be overridden for ensuring both module loading and for perl -e 'open(f,"<","my.txt");print(<f>);'. I'm only concerned for compiling/running on Linux for now.

If you just want modules to be loaded from memory you can add a hook to @INC that checks for a known name and loads that module from memory, see perldoc -f require.

Actually, interested both for modules and for regular, basic file reads. For modules, can such INC-hook be added via C perlembed interface (without executing Perl code)?

and only has a host implementation on Windows.

And regarding PerlIO infra, is it relevant for my usecase (module / *.pm loads and regular basic file reads)? Can it be configured via C perlembed interface? Or would you recommend using PERL_IMPLICIT_SYS? Or is using PERL_IMPLICIT_SYS on Linux impossible?

Thank you!

vadimkantorov avatar Sep 05 '24 09:09 vadimkantorov

I'll check out PERL_IMPLICIT_SYS - replacing all I/O is fine for my usecase, as my custom I/O functions only serve from in-memory for some special prefixes like /mnt/perl. Is anywhere any docs / examples of using PERL_IMPLICIT_SYS to override? (and what functions need to be overridden for ensuring both module loading and for perl -e 'open(f,"<","my.txt");print();'. I'm only concerned for compiling/running on Linux for now.

It's only ever been done before for Windows but there's no reason it would be impossible on Linux. See perlhost.h, win32.c and perllib.c in win32/ for prior art.

Leont avatar Sep 05 '24 17:09 Leont

Thanks for the pointers! I'll look into what entails using PERL_IMPLICIT_SYS on Linux.

And maybe the last question, if you would know if PerlIO can also be used for this I/O override goal? And if so, can it be configured via a C API before any Perl code gets executed?

vadimkantorov avatar Sep 05 '24 22:09 vadimkantorov

For modules, can such INC-hook be added via C perlembed interface (without executing Perl code)?

After perl_construct() something like:

CV *hook = newXS("MyPackage::my_hook", \&xs_my_hook_xs, __FILE__);
AV *inc = get_av("INC");
av_unshift(inc, 1);
av_store(inc, 0, newRV_noinc(hook));

You could also define the hook sub in perl with eval_pv()/eval_sv().

And maybe the last question, if you would know if PerlIO can also be used for this I/O override goal? And if so, can it be configured via a C API before any Perl code gets executed?

You might be able to do it by modifying PL_def_layerlist or via PERLIO in the environment, but I've never tried it.

It also won't allow you to hook operations like stat() and fcntl().

tonycoz avatar Sep 06 '24 00:09 tonycoz

You might be able to do it by modifying PL_def_layerlist or via PERLIO in the environment, but I've never tried it.

This would be interesting if this worked, and it could do for just *.pm embeds into the binary! It seems that stat() is not used on the *.pm search / loading path...

Basically, I was looking for a portable way to have hooks to embed and have loaded *.pm in fully static https://perldoc.perl.org/perlembed setup... I achieved this using static musl libc symbol renaming (https://github.com/vadimkantorov/perlpack), but having this using PerlIO hooks would be a nicer and a more portable way.

vadimkantorov avatar Jan 02 '25 16:01 vadimkantorov

CV *hook = newXS("MyPackage::my_hook", \&xs_my_hook_xs, __FILE__);

Would be grateful if anyone could suggest an example how this void xs_my_hook_xs(pTHX_ CV* cv) {} C-land INC hook should look like. If I understand it should pop from cv the reference to itself and the requested file name an then somehow push a return value?

From https://perldoc.perl.org/functions/require:

INC hook will be called with two parameters, the first a reference to itself, and the second the name of the file to be included (e.g., Foo/Bar.pm). The subroutine should return either nothing or else a list of up to four values in the following order:

A reference to a scalar, containing any initial source code to prepend to the file or generator output.

A filehandle, from which the file will be read.

A reference to a subroutine. If there is no filehandle (previous item), then this subroutine is expected to generate one line of source code per call, writing the line into [$_](https://perldoc.perl.org/perlvar#%24_) and returning 1, then finally at end of file returning 0. If there is a filehandle, then the subroutine will be called to act as a simple source filter, with the line as read in [$_](https://perldoc.perl.org/perlvar#%24_). Again, return 1 for each valid line, and 0 after all lines have been returned. For historical reasons the subroutine will receive a meaningless argument (in fact always the numeric value zero) as $_[0].

Optional state for the subroutine. The state is passed in as $_[1].

vadimkantorov avatar Jan 07 '25 15:01 vadimkantorov

I don't think overriding PerlIO is the way to go here. That is, unless you want the modules to be open()able as if they really existed.

Pushing your own handler (either a coderef or an object) into @INC is probably the best approach, in either case you need to add some code to core to handle that. An alternative approach might be to override the require opcode.

Leont avatar Jan 07 '25 20:01 Leont

Yeah, I think figuring out both the avenue of installing a INC-hook from C land (for embeding modueles' pm/pl files) and the PerlIO avenue (for embedding data-files installed into the Perl prefix) would be very useful nice for making basic embedded, static Perls more practical, simple and transparent.

It also would allow more simple compilation of Perl to WASI target (which might not have the built-in virtual FS feature as opposed to Emscripten)

vadimkantorov avatar Jan 07 '25 22:01 vadimkantorov

It also would allow more simple compilation of Perl to WASI target (which might not have the built-in virtual FS feature as opposed to Emscripten)

That really sounds like you should want to go the implicit sys route. That way you can virtualize all system/io interactions.

Leont avatar Jan 08 '25 13:01 Leont

It probably would help a lot if we'd first write a generic implicit sys implementation, because you probably don't want to redefine all of it.

Leont avatar Jan 08 '25 17:01 Leont

Yeah, I think ideally examples of both would be very useful:

  1. Generic PerlIO impls with example of override from C land
  2. INC hook installation from C land (for simpler and more specific cases of emdedding only pm/pl module files)

vadimkantorov avatar Jan 08 '25 17:01 vadimkantorov

Regarding avenue (2) (with INC hook): another way could be populating from C land a path=>modulesourcecode dictionary in the Perl interpreter state, and then adding the INC hook written in Perl which would access this dictionary. How can one do that nicely without having to prepare a perl statement like "mymodulesources["..."] = "..." (this is problematic because we need to escape the string, so ideally we'd create the string var from C land directly and then directly push it to the mymodulesources dictionary)?

Could I use somehow HeSVKEY_set (from https://perldoc.perl.org/perlapi) for this end? Just need to create a string var as SV somehow... Maybe via newSVpv?

vadimkantorov avatar Jan 09 '25 16:01 vadimkantorov

As a full example, here is what I am currently doing to override the *.pm (and some data files) loads from embedded prefix tree: https://github.com/busytex/busytex/blob/988f0a3337f461ab5cada8d410581d449edc032b/.github/workflows/build-biber.yml#L175-L669 for the https://github.com/plk/biber instead of using PAR / shared library linking

Defining custom IO functions as PerlIO layer / PERL_IMPLICIT_SYS would be more explicit and robust than using the --wrap switch of the linker and rolling some custom fd impls etc

vadimkantorov avatar Jan 13 '25 13:01 vadimkantorov