OLMo
OLMo copied to clipboard
Collect tokens from ArXiv LaTeX
RedPajama includes data sourced from ArXiv
An alternative is unArXive
S2's LaTex dumps from ArXiv are in s3://ai2-s2-scholarphi-pipeline-prod/daq/arxiv-source-data/bymonth/
@kyleclo @soldni
RedPajama's code produces raw LaTeX. Some cleaning, but un-parsed for the most part. Bibliography is discarded.
UnArXive uses tralics, a third-party C++ tool that translates LaTex into XML. The unArXive code parses the XML into S2ORC-like format. Bibliography is included. Math gets converted into a mixture of MathML and TeX expressions.
<formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>𝐳</mi><mo>∈</mo><mi>𝒵</mi></mrow></math><texmath>{\mathbf {z}}\in \mathcal {Z}</texmath></formula>
It looks like the math expressions are given in both MathML and TeX formats, so you can choose either one.
The math processing feels like a wash to me, but the XML format seems more useful if you want to produce natural language. You also get control over what to do with figures, etc.
RedPajama example:
Finally, in the \emph{Multi-Task Aggregation} stage, the different policies are integrated into a multi-task controller that can be directed using language commands to perform a specific task using a desired skill.
\begin{figure}
\centering
\includegraphics[width=0.9\columnwidth]{figures/overview_v2.png}
\caption{The PADL framework consists of three stages. 1) In the Skill Embedding stage, a dataset of motion clips and corresponding text captions are used to learn a joint embedding of motions and captions. 2) In the Policy Training stage, the learned skill embedding is used to train a collection of policies to perform various tasks, while imitating behaviors in the dataset. 3) Finally, in the Multi-Task Aggregation stage, policies trained for different tasks are combined into a multi-task controller that can be directed to perform different tasks and skills via language commands.}
\vspace{-0.4cm}
\label{fig:overview}
\end{figure}
\section{Skill Embedding}
\label{sec:skill-embedding}
In the Skill Embedding stage, our objective is to construct an embedding space that aligns motions with their corresponding natural language descriptions. To do this, we follow a similar procedure as MotionCLIP \citep{MotionClipTevet2022}, where a transformer autoencoder is trained to encode motion sequences into a latent representation that ``aligns'' with the language embedding from a pre-trained CLIP text encoder \citep{ClipRadford2021}. Given a motion clip $\hat{{\mathbf{m}}} = (\hat{{\mathbf{q}}}_1, ..., \hat{{\mathbf{q}}}_n)$ and its caption $c$, a motion encoder ${\mathbf{z}} = \mathrm{Enc}_m(\hat{{\mathbf{m}}})$ maps the motion to an embedding ${\mathbf{z}}$. The embedding is normalized to lie on a unit sphere $||{\mathbf{z}}|| = 1$. Following~\citet{MotionClipTevet2022}, $\mathrm{Enc}_m\left({\mathbf{m}} \right)$ is modeled by a bidirectional transformer \citep{bert2018}. A motion decoder is jointly trained with the encoder to produce a reconstruction sequence ${\mathbf{m}} = ({\mathbf{q}}_1, ..., {\mathbf{q}}_n)$ to recover $\hat{{\mathbf{m}}}$ from ${\mathbf{z}}$. The decoder is also modelled as a birectional transformer ${\mathbf{m}} = \mathrm{Dec}({\mathbf{z}}, {\mathbf{U}})$, which decodes all frames of in parallel using a learned constant query sequence ${\mathbf{U}} = ({\mathbf{u}}_1, ..., {\mathbf{u}}_n)$, similar to the final layer of \citet{detr}. The autoencoder is trained with the loss:
\begin{align}
\mc{L}_{\text{auto}} = \mc{L}_{\text{recon}} + 0.1\mc{L}_{\text{align}} .
\end{align}
Equivalent unArXive example:
Finally, in the <hi rend='it'>Multi-Task Aggregation</hi> stage, the different policies are integrated into a multi-task controller that can be directed using language commands to perform a specific task using a desired skill.</p>
<figure width='384.2974pt' file='figures/overview_v2' extension='png' id-text='1' id='uid6'><head>The PADL framework consists of three stages. 1) In the Skill Embedding stage, a dataset of motion clips and corresponding text captions are used to learn a joint embedding of motions and captions. 2) In the Policy Training stage, the learned skill embedding is used to train a collection of policies to perform various tasks, while imitating behaviors in the dataset. 3) Finally, in the Multi-Task Aggregation stage, policies trained for different tasks are combined into a multi-task controller that can be directed to perform different tasks and skills via language commands.</head>
</figure>
</div0>
<div0 id-text='5' id='cid5'><head>Skill Embedding</head>
<p>In the Skill Embedding stage, our objective is to construct an embedding space that aligns motions with their corresponding natural language descriptions. To do this, we follow a similar procedure as MotionCLIP MotionClipTevet2022, where a transformer autoencoder is trained to encode motion sequences into a latent representation that “aligns” with the language embedding from a pre-trained CLIP text encoder ClipRadford2021. Given a motion clip <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mover accent='true'><mi>𝐦</mi> <mo>^</mo></mover><mo>=</mo><mrow><mo>(</mo><msub><mover accent='true'><mi>𝐪</mi> <mo>^</mo></mover> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mover accent='true'><mi>𝐪</mi> <mo>^</mo></mover> <mi>n</mi> </msub><mo>)</mo></mrow></mrow></math><texmath>\hat{{\mathbf {m}}} = (\hat{{\mathbf {q}}}_1, ..., \hat{{\mathbf {q}}}_n)</texmath></formula> and its caption <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>c</mi></math><texmath>c</texmath></formula>, a motion encoder <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>𝐳</mi><mo>=</mo><msub><mi> Enc </mi> <mi>m</mi> </msub><mrow><mo>(</mo><mover accent='true'><mi>𝐦</mi> <mo>^</mo></mover><mo>)</mo></mrow></mrow></math><texmath>{\mathbf {z}}= \mathrm {Enc}_m(\hat{{\mathbf {m}}})</texmath></formula> maps the motion to an embedding <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>𝐳</mi></math><texmath>{\mathbf {z}}</texmath></formula>. The embedding is normalized to lie on a unit sphere <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mo>|</mo><mo>|</mo><mi>𝐳</mi><mo>|</mo><mo>|</mo><mo>=</mo><mn>1</mn></mrow></math><texmath>||{\mathbf {z}}|| = 1</texmath></formula>. Following MotionClipTevet2022, <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><msub><mi> Enc </mi> <mi>m</mi> </msub><mfenced open='(' close=')'><mi>𝐦</mi></mfenced></mrow></math><texmath>\mathrm {Enc}_m\left({\mathbf {m}}\right)</texmath></formula> is modeled by a bidirectional transformer bert2018. A motion decoder is jointly trained with the encoder to produce a reconstruction sequence <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>𝐦</mi><mo>=</mo><mo>(</mo><msub><mi>𝐪</mi> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>𝐪</mi> <mi>n</mi> </msub><mo>)</mo></mrow></math><texmath>{\mathbf {m}}= ({\mathbf {q}}_1, ..., {\mathbf {q}}_n)</texmath></formula> to recover <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mover accent='true'><mi>𝐦</mi> <mo>^</mo></mover></math><texmath>\hat{{\mathbf {m}}}</texmath></formula> from <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mi>𝐳</mi></math><texmath>{\mathbf {z}}</texmath></formula>. The decoder is also modelled as a birectional transformer <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>𝐦</mi><mo>=</mo><mi> Dec </mi><mo>(</mo><mi>𝐳</mi><mo>,</mo><mi>𝐔</mi><mo>)</mo></mrow></math><texmath>{\mathbf {m}}= \mathrm {Dec}({\mathbf {z}}, {\mathbf {U}})</texmath></formula>, which decodes all frames of in parallel using a learned constant query sequence <formula type='inline'><math xmlns='http://www.w3.org/1998/Math/MathML'><mrow><mi>𝐔</mi><mo>=</mo><mo>(</mo><msub><mi>𝐮</mi> <mn>1</mn> </msub><mo>,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msub><mi>𝐮</mi> <mi>n</mi> </msub><mo>)</mo></mrow></math><texmath>{\mathbf {U}}= ({\mathbf {u}}_1, ..., {\mathbf {u}}_n)</texmath></formula>, similar to the final layer of detr. The autoencoder is trained with the loss:</p>
<formula id-text='2' id='uid7' textype='align' type='display'><math mode='display' xmlns='http://www.w3.org/1998/Math/MathML'><mtable displaystyle='true'><mtr><mtd columnalign='right'><mrow><msub><mi>ℒ</mi> <mtext>auto</mtext> </msub><mo>=</mo><msub><mi>ℒ</mi> <mtext>recon</mtext> </msub><mo>+</mo><mn>0</mn><mo>.</mo><mn>1</mn><msub><mi>ℒ</mi> <mtext>align</mtext> </msub><mo>.</mo></mrow></mtd></mtr></mtable></math><texmath>
\mathcal {L}_{\text{auto}} = \mathcal {L}_{\text{recon}} + 0.1\mathcal {L}_{\text{align}} .
</texmath></formula>
Another third-party tool, pandoc, gives similar results:
Finally, in the <italic>Multi-Task Aggregation</italic> stage, the
different policies are integrated into a multi-task controller that
can be directed using language commands to perform a specific task
using a desired skill.</p>
<fig id="fig:overview">
<caption><p>The PADL framework consists of three stages. 1) In the
Skill Embedding stage, a dataset of motion clips and corresponding
text captions are used to learn a joint embedding of motions and
captions. 2) In the Policy Training stage, the learned skill
embedding is used to train a collection of policies to perform
various tasks, while imitating behaviors in the dataset. 3) Finally,
in the Multi-Task Aggregation stage, policies trained for different
tasks are combined into a multi-task controller that can be directed
to perform different tasks and skills via language
commands.</p></caption>
<graphic mimetype="image" mime-subtype="png" xlink:href="figures/overview_v2.png" xlink:title="" />
</fig>
<p><milestone-start id="fig:overview" />[fig:overview]<milestone-end /></p>
</sec>
<sec id="sec:skill-embedding">
<title>Skill Embedding</title>
<p>In the Skill Embedding stage, our objective is to construct an
embedding space that aligns motions with their corresponding natural
language descriptions. To do this, we follow a similar procedure as
MotionCLIP , where a transformer autoencoder is trained to encode
motion sequences into a latent representation that “aligns” with the
language embedding from a pre-trained CLIP text encoder . Given a
motion clip <inline-formula><alternatives>
<tex-math><![CDATA[\hat{{\mathbf{m}}} = (\hat{{\mathbf{q}}}_1, ..., \hat{{\mathbf{q}}}_n)]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mo stretchy
="false" form="prefix">(</mml:mo><mml:msub><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</m
ml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix"
>)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
and its caption <inline-formula><alternatives>
<tex-math><![CDATA[c]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mi>c</mml:mi></mml:math></alternatives></inline-formula>,
a motion encoder <inline-formula><alternatives>
<tex-math><![CDATA[{\mathbf{z}}= \mathrm{Enc}_m(\hat{{\mathbf{m}}})]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant="normal"><mml:mi>E</mml:mi><mml:mi>n
</mml:mi><mml:mi>c</mml:mi></mml:mstyle><mml:mi>m</mml:mi></mml:msub><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover><m
ml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
maps the motion to an embedding <inline-formula><alternatives>
<tex-math><![CDATA[{\mathbf{z}}]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula>.
The embedding is normalized to lie on a unit sphere
<inline-formula><alternatives>
<tex-math><![CDATA[||{\mathbf{z}}|| = 1]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></m
ml:mstyle><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo stretchy="false" form="prefix">|</mml:mo><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math></alternatives></inline-formula>.
Following , <inline-formula><alternatives>
<tex-math><![CDATA[\mathrm{Enc}_m\left({\mathbf{m}}\right)]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:msub><mml:mstyle mathvariant="normal"><mml:mi>E</mml:mi><mml:mi>n</mml:mi><mml:mi>c</mml:mi></mml:mstyle><mml:mi>m</mml:mi></mml:msub><mml:mrow><
mml:mo stretchy="true" form="prefix">(</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo stretchy="true" form="postfix">)</mml:mo></mml:mrow></mml:mrow></mml:math></alternatives></inline-formula>
is modeled by a bidirectional transformer . A motion decoder is
jointly trained with the encoder to produce a reconstruction sequence
<inline-formula><alternatives>
<tex-math><![CDATA[{\mathbf{m}}= ({\mathbf{q}}_1, ..., {\mathbf{q}}_n)]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:msub><mml:mstyle
mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:m
style><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>
to recover <inline-formula><alternatives>
<tex-math><![CDATA[\hat{{\mathbf{m}}}]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mover><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo accent="true">̂</mml:mo></mml:mover></mml:math></alternatives></inline-formula>
from <inline-formula><alternatives>
<tex-math><![CDATA[{\mathbf{z}}]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle></mml:math></alternatives></inline-formula>.
The decoder is also modelled as a birectional transformer
<inline-formula><alternatives>
<tex-math><![CDATA[{\mathbf{m}}= \mathrm{Dec}({\mathbf{z}}, {\mathbf{U}})]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mstyle mathvariant="normal"><mml:mi>D</mml:mi><mml:mi>e</mml:mi><
mml:mi>c</mml:mi></mml:mstyle><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>,</mml:mo><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo stretch
y="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>,
which decodes all frames of in parallel using a learned constant query
sequence <inline-formula><alternatives>
<tex-math><![CDATA[{\mathbf{U}}= ({\mathbf{u}}_1, ..., {\mathbf{u}}_n)]]></tex-math>
<mml:math display="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mo>=</mml:mo><mml:mo stretchy="false" form="prefix">(</mml:mo><mml:msub><mml:mstyle
mathvariant="bold"><mml:mi>�</mml:mi></mml:mstyle><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mi>.</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mstyle mathvariant="bold"><mml:mi>�</mml:mi></mml:m
style><mml:mi>n</mml:mi></mml:msub><mml:mo stretchy="false" form="postfix">)</mml:mo></mml:mrow></mml:math></alternatives></inline-formula>,
similar to the final layer of . The autoencoder is trained with the
loss:</p>
<p><disp-formula><alternatives>
<tex-math><![CDATA[\begin{aligned}
\mathcal{L}_{\text{auto}} = \mathcal{L}_{\text{recon}} + 0.1\mathcal{L}_{\text{align}} .\end{aligned}]]></tex-math>
<mml:math display="block" xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mtable><mml:mtr><mml:mtd columnalign="right"><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">auto</mml:mtext></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">recon</mml:mtext></mml:msub><mml:mo>+</mml:mo><mml:mn>0.1</mml:mn><mml:msub><mml:mstyle mathvariant="script"><mml:mi>ℒ</mml:mi></mml:mstyle><mml:mtext mathvariant="normal">align</mml:mtext></mml:msub><mml:mi>.</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></alternatives></disp-formula>
The main difference I see between pandoc
and tralics
is the handling of inline citations (\cite{abc}
, etc.). tralics
is inserting the reference ID: modeled by a bidirectional transformer bert2018.
while pandoc
drops it: modeled by a bidirectional transformer .
pandoc
produces XML in the JATS schema, while tralics
seems to be a custom format.
Marking the items prior to Feb 29th as "closed".