engrafo icon indicating copy to clipboard operation
engrafo copied to clipboard

Improve picking main .tex file from a directory

Open brienna opened this issue 6 years ago • 7 comments

Some submissions such as 0706.2986 fail to render, because engrafo currently cannot pick the right .tex file to use as the main .tex file. Its current criteria are contained in src/converter/io.js:

// Pick a main .tex file from a directory
async function pickLatexFile(dir) {
  if (dir.endsWith(".tex")) {
    return dir;
  }
  const files = await fs.readdir(dir);
  if (files.includes("ms.tex")) {
    return path.join(dir, "ms.tex");
  }
  if (files.includes("main.tex")) {
    return path.join(dir, "main.tex");
  }
  const texPaths = files.filter(f => f.endsWith(".tex"));
  if (texPaths.length === 0) {
    throw new Error("No .tex files found");
  }
  if (texPaths.length === 1) {
    return path.join(dir, texPaths[0]);
  }
  let docCandidates = [];
  for (let p of texPaths) {
    let data = await fs.readFile(path.join(dir, p));
    if (data && data.includes("\\documentclass")) {
      docCandidates.push(p);
    }
  }
  if (docCandidates.length === 0) {
    throw new Error("No .tex files with \\documentclass or \\documentstyle found");
  }

  if (docCandidates.length === 1) {
    return path.join(dir, docCandidates[0]);
  }

  let bblCandidates = [];
  for (let p of docCandidates) {
    let bbl = p.replace(".tex", ".bbl");
    if (await fs.pathExists(path.join(dir, bbl))) {
      bblCandidates.push(p);
    }
  }

  if (bblCandidates.length > 1) {
    throw new Error(
      `Ambiguous LaTeX path (${bblCandidates.length} candidates)`
    );
  }
  return bblCandidates[0];
}

0706.2986 has two .tex files. The first .tex file, psfig.tex, is not the main .tex file, but it contains the following line:

% To use with LaTeX, use \documentstyle[psfig,...]{...}

Engrafo will flag this as a potential candidate, along with the second .tex file townes_arXiv.tex (which is the real main .tex file). Since this submission contains no .bbl file to help the code clarify which candidate is the main .tex file, the render fails.

I propose that we add a regex to match to a line within the file if it contains \documentclass or \documentstyle but not if those tags are on lines that begin with a comment %. Such a regex might look like (?m)^(?!%)(?:.*\\\\document(?:class|style).*).

0902.1226, another submission that fails to render, has a similar problem where an incorrect candidate is chosen because it contains a \documentclass tag. This tag is not at the beginning of the line. It might be a better criterion to match a \documentclass or \documentstyle tag that begins the line. This would take care of both submissions.

brienna avatar Feb 24 '19 06:02 brienna

Sounds good to me! LaTeXML also has some similar logic, which is probably more battle hardened than ours. I forget where it is, but perhaps we could switch to using that, or just copy their logic.

bfirsh avatar Feb 24 '19 15:02 bfirsh

The latexml logic for finding the main .tex is in this file, in the unpack_source subroutine. One interesting condition is to veto candidates that are arguments of \input macros. Currently only unpacks ZIP archives though.

brienna avatar Feb 24 '19 16:02 brienna

Can You assign me this work I would like to work on it. @brienna

Aryansingh0103 avatar Mar 03 '23 05:03 Aryansingh0103

is anyone working on this issue? can you assign me this.

siddiksawani avatar May 27 '23 15:05 siddiksawani

I would guess it's still open, perhaps @bfirsh has the ability to assign people to the issue.

brienna avatar Jun 14 '23 15:06 brienna

Hi. Is this issue still unresolved? I am new to Open Source but would love to give this a try.

MohamedAITALLA avatar Sep 07 '23 23:09 MohamedAITALLA