pythontex icon indicating copy to clipboard operation
pythontex copied to clipboard

How does PythonTeX work precisely “under the hood”? [Haskell support]

Open Delanii opened this issue 4 years ago • 8 comments

I would like to ask you same question as I have posted earlier on TEX.SE (since I got no answer, and the question is mainly about internal working of PythonTeX, I hope it is ok to ask here directly):

https://tex.stackexchange.com/questions/548737/how-does-pythontex-work-precisely-under-the-hood?noredirect=1#comment1386126_548737

PythonTeX requires for correct typesetting code and its output at least 3 step compilation:

(lua/pdf/xe)latex
pythontex
(lua/pdf/xe)latex

On Windows 10, the first run creates (among others) file \jobname.pytxcode, pythontex run then creates (among others) files with .stdout extension, and third run reads content of these files and typesetts it.

Is that (at least remotely) correct? And how does it work precisely (on LaTeX, but also on Python side). I have been working with PythonTeX for some time, but I have found out that I am using it mostly as a "blackbox."

Motivation: I would like to create support for Haskell in PythonTeX. But Haskell has pretty specific IO and it would be very helpfull to me (and hopefully also for others) to know how PythonTeX precisely works.

I would like to add that I am Haskell newbie/hobbyist, but given writing another language support in PythonTeX shouldnt be complicated, I have started a small prototype-support (based on support for R). Yet right now I am also considering, whether to use gch, ghci or runghc.

Delanii avatar Jun 11 '20 08:06 Delanii

During the first latex run, all of the code and associated settings are saved to \jobname.pytxcode. When pythontex runs, it extracts the code from \jobname.pytxcode, uses templates to assemble code into temp files, executes the temp files, and then converts the output into files or LaTeX macros that LaTeX can use (.stdout and .pytxmcr). The second latex run imports the files and macros, and typesets everything.

To add another language, you will want to look at pythontex_engines.py. Looking at the R template is a fine place to start. You might want to look at the CodeEngine for Rust, since it is compiled.

For a really basic example, you might also look at bash:

bash_template = '''
    cd "{workingdir}"
    {body}
    echo "{dependencies_delim}"
    echo "{created_delim}"
    '''

bash_wrapper = '''
    echo "{stdoutdelim}"
    >&2 echo "{stderrdelim}"
    {code}
    '''

bash_sub = '''echo "{field_delim}"\necho {field}\n'''

CodeEngine('bash', 'bash', '.sh',
           '{bash} "{file}.sh"',
           bash_template, bash_wrapper, '{code}', bash_sub,
           ['error', 'Error'], ['warning', 'Warning'],
           'line {number}')

Basically, you need a whole-program/script template that changes to the working directory, includes the body of the code, and then writes some delimiters at the end. Each chunk of code needs a wrapper that writes a delimiter to stdout and to stderr at the very beginning, before the code. If you want the substitution environments to work (not required), you need a template for that too that writes a delim to stdout followed by a string representation of whatever corresponds to the substitution template field. The CodeEngine specifies the name, name of the language, extension, command to run (this can also be a list, see Rust), templates, what errors look like, what warnings look like, and a template for what references to line numbers in errors/warnings look like.

Let me know if you have questions. Also, I think you have seen https://github.com/gpoore/codebraid, my other project that is like PythonTeX for Pandoc Markdown. If using that is an option for you (you can typically just use LaTeX mixed in with Markdown), I believe there are Jupyter kernels for Haskell that already exist.

gpoore avatar Jun 16 '20 16:06 gpoore

The part about about "assembling temporary files" is little arcane to me. In bash support, there are echo commands, but as far as I know, in this form they are only printing informations to console. For writing from stdout to file I have found only this:

https://stackoverflow.com/questions/418896/how-to-redirect-output-to-a-file-and-stdout

Which shows commands using pipe operator (if I am correct) to put the otherwise only displayed output to file. Am I misunderstanding this interaction? Or does the CodeEngine class do some behind-the-scenes operation to grab console output and pipe it to file.stdout?

Motivation: If basic support for a language is mostly about "printing information to console" and not dealing with IO, than support for Haskell could be simply rewritten to haskell console-printing commands (with conversion to printable type), something along these lines (have to test that yet):

haskell_template = '''
    setCurrentDirectory "{workingdir}"
    {body}
    putStrLn show {dependencies_delim}
    putStrLn show {created_delim}
    '''

etc ... But if there is an "actual" file writing, then Haskell is little more complicated to approach. I hope I am not writing this overly complicated ...

Thank you for suggestion about Rust, I will take a very close look at that. At few glimpses the rust_tex_utils looks pretty complicated; but I guess you have meant the part inside main function. "Haskell platform" comes with ghc which allows execution in three modes:

  1. "standard" ghc compiler - similar to Rust, but I would like to avoid Rust support limitation of prohibited usage of main function in user code.
  2. ghci REPL - much like julia, I guess.
  3. runghc - allows Haskell code execution as "script", but code must also contain main() function (so eventually resolves to same limitation as with ghc)

runghc might be the most simplistic case, but I try to think about how to implement ghc as such.

I have looked at Codebraid and certainly want to try it out, but I have an ongoing project already written in LaTeX and I am not aware of pandoc processing, that would allow having part of document in .tex format, and part as .md. That is possible?? Actually, as far as I know, pandoc by itself should allow to integrate Haskell code within (and I think also to execute it), but I have never tryed that. And for Jupyter I have tryed to use IHaskell kernel, but as a Windows user I had to use wsl to be able to run it, but still, the kernel did not load into Jupyter.

I should add that I have programming as a hobby (but that might be outright obvious), so I might be lacking some "common knowledge." Also, I am a Windows user (considering moving to Linux), so that also affects the magnitude of issues I am dealing with.

Delanii avatar Jun 27 '20 15:06 Delanii

All of the printing in the templates is to the console (to stdout and to stderr), and then PythonTeX handles capturing those (currently, they are redirected to a file, but that is all handled by PythonTeX). I don't have any significant experience with Haskell, but it looks like putStrLn and hPutStrLn stderr will do what is needed.

For using Rust as an example: You can ignore all of the utils code. That is only required if you want dependency tracking and other more advanced features. None of that is needed for basic functioning.

Currently, PythonTeX doesn't support running code through REPLs, so ghci probably won't be an option. There are REPL/console style modes for Python and Julia, but that is because there are special systems for running code that emulate REPL/console execution. It's not really using REPL/console. Actually using REPL/console is technically possible, but difficult to get right. I made some progress with basics a few years back, but never got very far and ran across many issues. I will probably get some limited REPL/console support in Codebraid soon.

Given how PythonTeX works, there probably isn't a way to avoid the prohibition on main() in user code. The start and end of main() will probably have to be in the overall template, as opposed to being entered by the user. Codebraid has a better system that allows you to set outside_main=true and handle everything yourself.

Depending on how complex your project is, it might be possible to send the .tex through pandoc to Markdown and then use that with Codebraid. Actually, another option might be to write some things in Markdown with Codebraid, then convert that to .tex with pandoc, and \input that into your current document.

Since Pandoc can input LaTeX, it should be possible to get Codebraid to work with LaTeX in addition to Markdown. I just haven't gotten that far yet.

Everything should work with Windows. I'm usually working under Windows myself.

gpoore avatar Jun 29 '20 20:06 gpoore

I have managed to write down basic support for haskell as such:

haskell_template = '''
    import System.Directory
    import System.IO

    main = do
         setCurrentDirectory "{workingdir}"
         {body}
         putStrLn "{dependencies_delim}"
         putStrLn "{created_delim}"
    '''

haskell_wrapper = '''
    putStrLn "{stdoutdelim}"
    hPutStrLn stderr "{stderrdelim}"
    {code}
    '''

haskell_sub = '''
    putStrLn "{field_delim}"
    putStrLn "{field}"
    '''

CodeEngine('haskell', 'haskell', '.hs',
           '{ghc} --make "{file}.hs"',
           haskell_template, haskell_wrapper, 'putStrLn {code}', haskell_sub,
           ['error', 'Error'], ['warning', 'Warning'],
           'line {number}')

SubCodeEngine('haskell', 'hs')

also adding into pythontex.sty on line 1377:

\ifstrequal{#1}{haskell}{\makepythontexfamily[pyglexer=haskell]{haskell}}{}%

With this setting, I am very often getting parse errors from ghc. Those are usually caused by putting imports somewhere else than on the beginning of the script, or using incorrect indentation.

Consider almost simplest haskell code:

putStrLn "a"

With above mentioned setting, what is exactly the content of source file passed to ghc?

Something like:

import System.Directory
import System.IO

main = do
             setCurrentDirectory "{workingdir}"
             putStrLn "{stdoutdelim}"
             hPutStrLn stderr "{stderrdelim}"
             putStrLn "a"
             putStrLn "{dependencies_delim}"
             putStrLn "{created_delim}"

or something else? Does anywhere in the process space-gobbling happen? Errors that I am getting suggest so, I am not sure.

I am also looking to work more with pandoc (but there are limitations along support only subset of LaTeX) or ConTeXt, which could be more suitable for future projects. Still, I would like to put some more time to try to add support to pythontex as well, if I would be up to it.

Delanii avatar Jul 19 '20 14:07 Delanii

You can use \usepackage[keeptemps]{pythontex} to keep all temp files in the pythontex-files-* directory. That way, you can see exactly what is being executed.

Just looking at this, one issue seems to be indentation. You want everything under main = do to be indented, but the template code (wrapper) and the code you are supplying yourself are not indented. To fix this, we would need to add an option to add indentation to all template and user code. That should be a straightforward process. I can look into adding that feature.

If I am understanding correctly, import must always be used before function definitions, so import will never work within user code because that will be inside main. There are a few ways to try to work around that.

  • You could add more imports to the template. But that will always limit things somewhat.
  • Otherwise, the overall code execution system would need some sort of modification to allow code to be inserted before main.
    • It might be possible to look at the first chunk of code and relocate every line that starts with import to before main. I did something similar with Python to deal with imports from __future__. Although at that point it seems like we're starting to create something that isn't exactly Haskell anymore.
    • Another option would be having a way to insert code into the template before main. Codebraid has limited capabilities for this sort of thing, but PythonTeX doesn't at this point. That is definitely doable, it would just require more work.

Regarding Pandoc: Recent versions now have raw inlines and raw blocks, so I believe that removes most LaTeX limitations. For example, `\LaTeX`{=latex} gets passed straight through to LaTeX without modification, and the same thing is possible for code blocks by starting with ```{=latex}. With what you have already put together here, I could probably add Haskell support for Codebraid (run Haskell in Pandoc Markdown) relatively easily if that would be helpful.

gpoore avatar Jul 19 '20 15:07 gpoore

With pandoc there is already support for literate haskell in means of an extension. I think that there is no substantial difference between code execution or compilation of literate haskell or "normal" haskell.

My question was motivated mostly by option to add support for haskell to PythonTeX. As you have written, it seems mostly sensible only if there will be option to put code outside of main function. Adding indentation should be doable, as moving import declarations around, but in haskell most coding is happening outside of main function, I believe.

Are you interested in adding feature to allow code outside of main? I know you have mentioned already, that this would be desired also for Rust. How could I be helpful in that?

With that, codebraid and pandoc itself are definitely an option, even though harder to automate (with PythonTeX, I can now simply leave compilation to arara and check the result after an hour ...)

Delanii avatar Jul 20 '20 14:07 Delanii

Sorry for the delay in responding. I'm back to in-person teaching combined with hybrid/online in some cases to handle quarantined students, and that's severely limiting my time for software projects. If you can come up with a way to get Haskell working with PythonTeX without too many changes, I'm happy to accept a pull request. Otherwise, I may be able to think about this again in a few months. My eventual goal is to replace the code execution part of PythonTeX with the code execution part of Codebraid, and if I ever have some time to do that, then supporting Haskell should be trivial.

gpoore avatar Sep 02 '20 02:09 gpoore

Its alright, those are weird and complicated times. I was actually taking this issue as postponed from your previous comment. With your latest reaction, I am even more inclined to leave this issue as is now and when I need haskell, use codebraid, until PythonTeX will have the same capabilities. I have actually started to migrate my new projects to pandoc processing (and with that in time utilizing codebraid over PythonTeX when suitable), but it takes some time ... From time to time I am looking also at codebraid issue page; and there are some wonderfull things in motion, so for the time being I might even use codebraid more than PythonTeX. I will watch closely both projects and try to help whenever possible (and able); and after PythonTeX update I will look into adding haskell into its family of supported languages.

Delanii avatar Sep 03 '20 18:09 Delanii