OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Collect tokens from ArXiv LaTeX

Open rodneykinney opened this issue 1 year ago • 3 comments

RedPajama includes data sourced from ArXiv

An alternative is unArXive

S2's LaTex dumps from ArXiv are in s3://ai2-s2-scholarphi-pipeline-prod/daq/arxiv-source-data/bymonth/

@kyleclo @soldni

rodneykinney avatar May 08 '23 20:05 rodneykinney

RedPajama's code produces raw LaTeX. Some cleaning, but un-parsed for the most part. Bibliography is discarded.

UnArXive uses tralics, a third-party C++ tool that translates LaTex into XML. The unArXive code parses the XML into S2ORC-like format. Bibliography is included. Math gets converted into a mixture of MathML and TeX expressions.

<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D433;</mi><mo>&#x02208;</mo><mi>&#x1D4B5;</mi></mrow></math><texmath>{\mathbf {z}}\in \mathcal {Z}</texmath></formula>

It looks like the math expressions are given in both MathML and TeX formats, so you can choose either one.

rodneykinney avatar May 08 '23 20:05 rodneykinney

The math processing feels like a wash to me, but the XML format seems more useful if you want to produce natural language. You also get control over what to do with figures, etc.

RedPajama example:

Finally, in the \emph{Multi-Task Aggregation} stage, the different policies are integrated into a multi-task controller that can be directed using language commands to perform a specific task using a desired skill.

\begin{figure}
    \centering
    \includegraphics[width=0.9\columnwidth]{figures/overview_v2.png}
    \caption{The PADL framework consists of three stages. 1) In the Skill Embedding stage, a dataset of motion clips and corresponding text captions are used to learn a joint embedding of motions and captions. 2) In the Policy Training stage, the learned skill embedding is used to train a collection of policies to perform various tasks, while imitating behaviors in the dataset. 3) Finally, in the Multi-Task Aggregation stage, policies trained for different tasks are combined into a multi-task controller that can be directed to perform different tasks and skills via language commands.}
    \vspace{-0.4cm}
    \label{fig:overview}
\end{figure}

\section{Skill Embedding}
\label{sec:skill-embedding}

In the Skill Embedding stage, our objective is to construct an embedding space that aligns motions with their corresponding natural language descriptions. To do this, we follow a similar procedure as MotionCLIP \citep{MotionClipTevet2022}, where a transformer autoencoder is trained to encode motion sequences into a latent representation that ``aligns'' with the language embedding from a pre-trained CLIP text encoder \citep{ClipRadford2021}. Given a motion clip $\hat{{\mathbf{m}}} = (\hat{{\mathbf{q}}}_1, ..., \hat{{\mathbf{q}}}_n)$ and its caption $c$, a motion encoder ${\mathbf{z}} = \mathrm{Enc}_m(\hat{{\mathbf{m}}})$ maps the motion to an embedding ${\mathbf{z}}$. The embedding is normalized to lie on a unit sphere $||{\mathbf{z}}|| = 1$. Following~\citet{MotionClipTevet2022}, $\mathrm{Enc}_m\left({\mathbf{m}} \right)$ is modeled by a bidirectional transformer \citep{bert2018}. A motion decoder is jointly trained with the encoder to produce a reconstruction sequence ${\mathbf{m}} = ({\mathbf{q}}_1, ..., {\mathbf{q}}_n)$ to recover $\hat{{\mathbf{m}}}$ from ${\mathbf{z}}$. The decoder is also modelled as a birectional transformer ${\mathbf{m}} = \mathrm{Dec}({\mathbf{z}}, {\mathbf{U}})$, which decodes all frames of in parallel using a learned constant query sequence ${\mathbf{U}} = ({\mathbf{u}}_1, ..., {\mathbf{u}}_n)$, similar to the final layer of \citet{detr}. The autoencoder is trained with the loss:


\begin{align}
\mc{L}_{\text{auto}} = \mc{L}_{\text{recon}} + 0.1\mc{L}_{\text{align}} .
\end{align}

Equivalent unArXive example:

Finally, in the <hi rend='it'>Multi-Task Aggregation</hi> stage, the different policies are integrated into a multi-task controller that can be directed using language commands to perform a specific task using a desired skill.</p>
<figure width='384.2974pt' file='figures/overview_v2' extension='png' id-text='1' id='uid6'><head>The PADL framework consists of three stages. 1) In the Skill Embedding stage, a dataset of motion clips and corresponding text captions are used to learn a joint embedding of motions and captions. 2) In the Policy Training stage, the learned skill embedding is used to train a collection of policies to perform various tasks, while imitating behaviors in the dataset. 3) Finally, in the Multi-Task Aggregation stage, policies trained for different tasks are combined into a multi-task controller that can be directed to perform different tasks and skills via language commands.</head>
</figure>
</div0>
<div0 id-text='5' id='cid5'><head>Skill Embedding</head>
<p>In the Skill Embedding stage, our objective is to construct an embedding space that aligns motions with their corresponding natural language descriptions. To do this, we follow a similar procedure as MotionCLIP MotionClipTevet2022, where a transformer autoencoder is trained to encode motion sequences into a latent representation that “aligns” with the language embedding from a pre-trained CLIP text encoder ClipRadford2021. Given a motion clip <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mover accent='true'><mi>&#x1D426;</mi> <mo>&#x5E;</mo></mover><mo>=</mo><mrow><mo>(</mo><msub><mover accent='true'><mi>&#x1D42A;</mi> <mo>&#x5E;</mo></mover> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mover accent='true'><mi>&#x1D42A;</mi> <mo>&#x5E;</mo></mover> <mi>n</mi> </msub><mo>)</mo></mrow></mrow></math><texmath>\hat{{\mathbf {m}}} = (\hat{{\mathbf {q}}}_1, ..., \hat{{\mathbf {q}}}_n)</texmath></formula> and its caption <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>c</mi></math><texmath>c</texmath></formula>, a motion encoder <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D433;</mi><mo>=</mo><msub><mi> Enc </mi> <mi>m</mi> </msub><mrow><mo>(</mo><mover accent='true'><mi>&#x1D426;</mi> <mo>&#x5E;</mo></mover><mo>)</mo></mrow></mrow></math><texmath>{\mathbf {z}}= \mathrm {Enc}_m(\hat{{\mathbf {m}}})</texmath></formula> maps the motion to an embedding <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>&#x1D433;</mi></math><texmath>{\mathbf {z}}</texmath></formula>. The embedding is normalized to lie on a unit sphere <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mo>|</mo><mo>|</mo><mi>&#x1D433;</mi><mo>|</mo><mo>|</mo><mo>=</mo><mn>1</mn></mrow></math><texmath>||{\mathbf {z}}|| = 1</texmath></formula>. Following MotionClipTevet2022, <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><msub><mi> Enc </mi> <mi>m</mi> </msub><mfenced open='(' close=')'><mi>&#x1D426;</mi></mfenced></mrow></math><texmath>\mathrm {Enc}_m\left({\mathbf {m}}\right)</texmath></formula> is modeled by a bidirectional transformer bert2018. A motion decoder is jointly trained with the encoder to produce a reconstruction sequence <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D426;</mi><mo>=</mo><mo>(</mo><msub><mi>&#x1D42A;</mi> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>&#x1D42A;</mi> <mi>n</mi> </msub><mo>)</mo></mrow></math><texmath>{\mathbf {m}}= ({\mathbf {q}}_1, ..., {\mathbf {q}}_n)</texmath></formula> to recover <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mover accent='true'><mi>&#x1D426;</mi> <mo>&#x5E;</mo></mover></math><texmath>\hat{{\mathbf {m}}}</texmath></formula> from <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>&#x1D433;</mi></math><texmath>{\mathbf {z}}</texmath></formula>. The decoder is also modelled as a birectional transformer <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D426;</mi><mo>=</mo><mi> Dec </mi><mo>(</mo><mi>&#x1D433;</mi><mo>,</mo><mi>&#x1D414;</mi><mo>)</mo></mrow></math><texmath>{\mathbf {m}}= \mathrm {Dec}({\mathbf {z}}, {\mathbf {U}})</texmath></formula>, which decodes all frames of in parallel using a learned constant query sequence <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>&#x1D414;</mi><mo>=</mo><mo>(</mo><msub><mi>&#x1D42E;</mi> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>&#x1D42E;</mi> <mi>n</mi> </msub><mo>)</mo></mrow></math><texmath>{\mathbf {U}}= ({\mathbf {u}}_1, ..., {\mathbf {u}}_n)</texmath></formula>, similar to the final layer of detr. The autoencoder is trained with the loss:</p>
<formula id-text='2' id='uid7' textype='align' type='display'><math mode='display' xmlns='http://www.w3.org/1998/Math/MathML'><mtable displaystyle='true'><mtr><mtd columnalign='right'><mrow><msub><mi>&#x2112;</mi> <mtext>auto</mtext> </msub><mo>=</mo><msub><mi>&#x2112;</mi> <mtext>recon</mtext> </msub><mo>+</mo><mn>0</mn><mo>.</mo><mn>1</mn><msub><mi>&#x2112;</mi> <mtext>align</mtext> </msub><mo>.</mo></mrow></mtd></mtr></mtable></math><texmath>
\mathcal {L}_{\text{auto}} = \mathcal {L}_{\text{recon}} + 0.1\mathcal {L}_{\text{align}} .
</texmath></formula>

rodneykinney avatar May 08 '23 21:05 rodneykinney

Another third-party tool, pandoc, gives similar results:

  Finally, in the <italic>Multi-Task Aggregation</italic> stage, the
  different policies are integrated into a multi-task controller that
  can be directed using language commands to perform a specific task
  using a desired skill.</p>
  <fig id="fig:overview">
    <caption><p>The PADL framework consists of three stages. 1) In the
    Skill Embedding stage, a dataset of motion clips and corresponding
    text captions are used to learn a joint embedding of motions and
    captions. 2) In the Policy Training stage, the learned skill
    embedding is used to train a collection of policies to perform
    various tasks, while imitating behaviors in the dataset. 3) Finally,
    in the Multi-Task Aggregation stage, policies trained for different
    tasks are combined into a multi-task controller that can be directed
    to perform different tasks and skills via language
    commands.</p></caption>
    <graphic mimetype="image" mime-subtype="png" xlink:href="figures/overview_v2.png" xlink:title="" />
  </fig>
  <p><milestone-start id="fig:overview" />[fig:overview]<milestone-end /></p>
</sec>
<sec id="sec:skill-embedding">
  <title>Skill Embedding</title>
  <p>In the Skill Embedding stage, our objective is to construct an
  embedding space that aligns motions with their corresponding natural
  language descriptions. To do this, we follow a similar procedure as
  MotionCLIP , where a transformer autoencoder is trained to encode
  motion sequences into a latent representation that “aligns” with the
  language embedding from a pre-trained CLIP text encoder . Given a
  motion clip <inline-formula><alternatives>
  <tex-math><![CDATA[\hat{{\mathbf{m}}} = (\hat{{\mathbf{q}}}_1, ..., \hat{{\mathbf{q}}}_n)]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo stretchy
="false" form="prefix">(</mml:mo><mml:msub><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</m
ml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix"
>)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
  and its caption <inline-formula><alternatives>
  <tex-math><![CDATA[c]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>c</mml:mi></mml:math></alternatives></inline-formula>,
  a motion encoder <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{z}}= \mathrm{Enc}_m(\hat{{\mathbf{m}}})]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant="normal"><mml:mi>E</mml:mi><mml:mi>n
</mml:mi><mml:mi>c</mml:mi></mml:mstyle><mml:mi>m</mml:mi></mml:msub><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><m
ml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
  maps the motion to an embedding <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{z}}]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula>.
  The embedding is normalized to lie on a unit sphere
  <inline-formula><alternatives>
  <tex-math><![CDATA[||{\mathbf{z}}|| = 1]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></m
ml:mstyle><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></alternatives></inline-formula>.
  Following , <inline-formula><alternatives>
  <tex-math><![CDATA[\mathrm{Enc}_m\left({\mathbf{m}}\right)]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:msub><mml:mstyle mathvariant="normal"><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi></mml:mstyle><mml:mi>m</mml:mi></mml:msub><mml:mrow><
mml:mo stretchy="true" form="prefix">(</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo stretchy="true" form="postfix">)</mml:mo></mml:mrow></mml:mrow></mml:math></alternatives></inline-formula>
  is modeled by a bidirectional transformer . A motion decoder is
  jointly trained with the encoder to produce a reconstruction sequence
  <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{m}}= ({\mathbf{q}}_1, ..., {\mathbf{q}}_n)]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:msub><mml:mstyle
 mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:m
style><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
  to recover <inline-formula><alternatives>
  <tex-math><![CDATA[\hat{{\mathbf{m}}}]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover></mml:math></alternatives></inline-formula>
  from <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{z}}]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula>.
  The decoder is also modelled as a birectional transformer
  <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{m}}= \mathrm{Dec}({\mathbf{z}}, {\mathbf{U}})]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>D</mml:mi><mml:mi>e</mml:mi><
mml:mi>c</mml:mi></mml:mstyle><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo stretch
y="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>,
  which decodes all frames of in parallel using a learned constant query
  sequence <inline-formula><alternatives>
  <tex-math><![CDATA[{\mathbf{U}}= ({\mathbf{u}}_1, ..., {\mathbf{u}}_n)]]></tex-math>
  <mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:msub><mml:mstyle
 mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:m
style><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>,
  similar to the final layer of . The autoencoder is trained with the
  loss:</p>
  <p><disp-formula><alternatives>
  <tex-math><![CDATA[\begin{aligned}
  \mathcal{L}_{\text{auto}} = \mathcal{L}_{\text{recon}} + 0.1\mathcal{L}_{\text{align}} .\end{aligned}]]></tex-math>
  <mml:math display="block" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mtable><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">auto</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">recon</mml:mtext></mml:msub><mml:mo>+</mml:mo><mml:mn>0.1</mml:mn><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">align</mml:mtext></mml:msub><mml:mi>.</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></alternatives></disp-formula>

The main difference I see between pandoc and tralics is the handling of inline citations (\cite{abc}, etc.). tralics is inserting the reference ID: modeled by a bidirectional transformer bert2018. while pandoc drops it: modeled by a bidirectional transformer . pandoc produces XML in the JATS schema, while tralics seems to be a custom format.

rodneykinney avatar May 08 '23 21:05 rodneykinney

Marking the items prior to Feb 29th as "closed".

dumitrac avatar Apr 30 '24 20:04 dumitrac