AIM Cyber Infrastructure Workshop
ar5iv
arXiv
LaTeXML
https://creativecommons.org/licenses/by/4.0/
An invited talk for AIM Cyber Infrastructure Workshop, covering HTML for arXiv, ar5iv, LaTeXML and MathML.
2023-12-07
class: center, middle ### LaTeXML à la carte
*Aim at Cyber Infrastructure*
.footnote[**Deyan Ginev**] .remark-venue[AIM Cyber Infrastructure Workshop
December 7, 2023]
--- # Topics Technologies and spice: 1.
A quick on-ramp
2. arXiv and HTML - history and sustainability 3. LaTeXML and LaTeX - get boring, stay better 4. MathML in 2023 - it's time! Q&A or coffee break: - NLP for arXiv, parsing for math syntax, scholarly HTML --- class: center ## arXiv's 2023 call to action
https://info.arxiv.org/about/accessible_HTML.html --- class: center ## arXiv submissions: where are we?
90% include LaTeX sources --- class: center ## TLDR: What is ar5iv? An official preview site for arXiv.org as HTML5 (with MathML)
2.1 million e-print documents
97% of all LaTeX sources, 74.5% error-free
available at [ar5iv.labs.arxiv.org](https://ar5iv.labs.arxiv.org) --- ## LaTeXML https://dlmf.nist.gov/LaTeXML * "free, public domain software, which converts LaTeX documents
to XML, HTML, EPUB, JATS and TEI" [(wiki)](https://en.wikipedia.org/wiki/LaTeXML) * Developed for, and maintained by - NIST Digital Library of Mathematical Functions - [dlmf.nist.gov](https://dlmf.nist.gov/) * version **0.8.7**, a production-ready Perl application * new version coming up shortly * implements a variant of the TeX typesetting engine * and a small part of the CTAN package ecosystem * over 500 supported LaTeX packages * with another >50 experimental --- # Topics Technologies and spice: 1. A quick on-ramp 2. arXiv and HTML -
history and sustainability
3. LaTeXML and LaTeX - get boring, stay better 4. MathML in 2023 - it's time! Q&A or coffee break: - NLP for arXiv, parsing for math syntax, scholarly HTML --- ## An idea in 2006 Several one-on-one conversations between: * Michael Kohlhase, KWARC * Bruce Miller, NIST * Robert Miner, Design Science
Common needs: - "It would be great to modernize arXiv to XHTML." - "It would be great to have enough MathML for Math Search." - "It would be great to solve LaTeX conversion to XHTML."
*Why 2006?* W3C group is re-chartered for work on MathML 3. --- ## A group of enthusiasts in 2007 - arXMLiv project umbrella is created in KWARC - I join, age 19, and stick around - age 36 today. - together with 15+ Jacobs University undergraduates - enable more LaTeX from arXiv by extending LaTeXML - LaTeXML 0.5.99 released --- ## A first result in 2008
"Transforming the arXiv to XML"
Heinrich Stamerjohanns and Michael Kohlhase
CICM 2008
- 58% success claimed, of ~400,000 documents. - LaTeXML 0.6.0 released --- ## A second result in 2010
"Transforming Large Collections of Scientific Publications to XML"
Heinrich Stamerjohanns, Michael Kohlhase, Deyan Ginev, Catalin David and Bruce Miller
Mathematics in Computer Science (MCS)
- 61% success claimed, of ~400,000 documents. - at LaTeXML 0.7.0 --- ## Sustainability? 15 years in hindsight
--- ## Trough of Disillusionment, 2011-2015 - most arXMLiv members graduated onward - I join a startup at 2014 - available hardware changed every couple of years - original build system, now unmaintained - a new build system took until 2016 to stabilize - incremental progress becomes demotivating - each new batch of arXiv sources lowers success rates - and yet LaTeXML calmly advances to 0.8.1 --- ## Slope of Enlightenment, 2016-2018 - 2015, latexml introduces OmniBus.cls.ltxml - new trend for graceful fallback behaviors - The new Rust build system stabilizes - leveraging ZeroMQ, PostgreSQL, Redis - deployed at [corpora.mathweb.org](https://corpora.mathweb.org) - Rebound in conversion success - 68.04% success on the original 400,000 docs, up from 61% - But! we regress to 59.83% on the full arXiv corpus - 1.07 million documents, until 08.2016. - return to 64.17% overall success only in 08.2018, latexml 0.8.3 --- ## Plateau of productivity, 2018-2023 - We continue to test LaTeXML with arXiv data every year - Bruce tends to add a dozen LaTeX packages per release - Michael continues to provide hardware availability at KWARC - I start working full-time with Bruce on LaTeXML in 2018. - incremental progress becomes motivating. --- ## The vanity slide, 2017-2021 Surprise! - arxiv-vanity.org is independently created - by Ben Firshman and Andreas Jansson - a live preview site for arXiv, initially based on pandoc - mobile-friendly responsive design - we touch base, vanity switches to LaTeXML - arXiv reaches out to all in 2017, interested in having native HTML - a number of preliminary steps, arXiv-NG effort - ... but internal transition leads to a pause - Time for arxiv-vanity maintenace dries up in early 2021 --- ## The need for ar5iv, 2021 - new need and opportunity for a (maintained) preview site - a lot of community interest was developed by -vanity. - In September, I embark on building ar5iv - 3 months of heavy CSS work. - on good days reviewing 100 articles. - we already have the HTML5 data - public dataset releases since 2017, part of SIGMathLing - https://sigmathling.kwarc.info/resources/arxmliv/ - sprinkle in some marketing tricks I picked up in NYC - take any arXiv URL and replace "X" with "5" to get the HTML - ask academic Twitter to spread the word --- ## The ar5iv moment 📢 Jan 31, 2022
People... got excited?
Half a million recipients in a week.
--- ## The ar5iv Lab - Feb 21, 2022
Now at [ar5iv.labs.arxiv.org](https://ar5iv.labs.arxiv.org) --- ## Lessons learned - only possible with long-term NIST support for LaTeXML - and with long-term commitment of core members - incremental progress is still progress - at +2% a year, we hit 100% in 13 years *(if only...)* - arXMLiv and ar5iv had **zero** dedicated funding - borrowed hardware, borrowed time - we can not wish away sticky needs - "PDF is all you need." - Wrong! - arxiv-vanity is proof. - project burnout is real - it takes fresh perspectives to heal --- class:center ## 2023: Official arXiv.org HTML Any day now... sneak peek at https://browse.arxiv.org/ --- # Topics Technologies and spice: 1. A quick on-ramp 2. arXiv and HTML - history and sustainability 3.
LaTeXML and LaTeX
- get boring, stay better 4. MathML in 2023 - it's time! Q&A or coffee break: - NLP for arXiv, parsing for math syntax, scholarly HTML --- class: tight-list ## A taste of LaTeXML's ecosystem * NIST: DLMF * Research * Search: [ARQMath](https://www.cs.rit.edu/~dprl/ARQMath/2021/) and NTCIR Math Task * Data: [arXMLiv](https://sigmathling.kwarc.info/resources/arxmliv-dataset-2020/) * OCR: Nougat by Meta AI, [arXiv:2308.13418](https://arxiv.org/abs/2308.13418) * Contributors and tools: * [BookML](https://vlmantova.github.io/bookml/) extension * ResearchGate contributed JATS output * GROBID produces TEI from LaTeX * Books and docs: * European Space Agency: [GAIA Data Docs](https://gea.esac.esa.int/archive/documentation/GDR3/) * [forall x: Calgary](https://forallx.openlogicproject.org/) * [Neuronal Dynamics](https://neuronaldynamics.epfl.ch/online/index.html) * [Artificial Intelligence: Foundations of Computational Agents, 3rd Edition](https://artint.info/3e/html/ArtInt3e.html) --- ## Why isn't LaTeX to HTML solved yet ? (1) Ten+ mature conversion tools
* filter: released pre-2020 with active development post-2020 * healthy: 10 projects in 8 prog. languages
latex2html
perl
link
latexml
perl
link
lwarp
TeX, lua
link
pandoc
haskell, lua
link
plastex
python
link
tex2page
scheme
link
tex4ht
TeX, lua
link
tralics
C++
link
HeVeA
OCaml
link
TTH
C
link
--- ## Why isn't LaTeX to HTML solved yet ? (2)
Answer 1:
Of course it is. - "Markdown subset" is well-supported by all tools - so is basic macro expansion
Answer 2:
It is impossible. - a Turing-complete moving target - LaTeX 3 is taking over - CTAN packages continue to evolve - data model mismatch: trees are not boxes
Answer 3:
Bonsai garden - possible for some trees to match with some boxes - graceful fallback to layout directives for the rest --- ## Why isn't LaTeX to HTML solved yet ? (3) The Web is a moving target *Dark mode papers on smartphones?!* 1. Executable documents - reproducible models, interactive visualizations - Authorea, Curvenote, JupyterBooks, ... 2. Science movies - Arcadia Science is [now using](https://research.arcadiascience.com/pub/resource-microchamber-slide#the-microchamber-apparatus-is-effective-at-confining-cells-for-imaging-over-several-hours) the HTML `
` tag 3. Scholarly HTML? - schema.org microdata for scientific documents - standard markup for frontmatter, figures, tables, listings, calculi ... - **interoperable** UI components 4. HTML evolves - CSS flexbox, grid, @media - MathML 4, ARIA --- ## The LaTeX Social Contract - arXiv authors write for their human readers - If the PDF can be read, the job is done - authors do **not** write to please markup developers - we all inevitably take shortcuts - the exception proves the rule - So far, adoption favors messy systems --- ## Markup in a hurry, narrative 1. hand-made abstract (arXiv:2208.XXXXX) ```latex \maketitle \begin{quote} \textbf{Abstract} \end{quote} ``` 2. hand-made section (arXiv:hep-ph/95XXXXX) ```latex \clearpage {\bf Introduction.} ``` 3. statements used as list items (arXiv:math/02XXXXX) ```latex \newtheorem*{intropar}{\addtocounter{thm}{1}\arabic{thm}} \begin{intropar} item 1 content \end{intropar} \begin{intropar} item 2 content \end{intropar} ... ``` --- ## Markup in a hurry, math mode 1. hand-made footnote marks (arXiv:1012.XXXX) ``` ... Author Name${}^3$ and Other Name${}^4$ ``` 2. hand-made list item decorations (arXiv:hep-ph/95XXXXX) ```latex $\bullet$ item one content $\bullet$ item two content ``` 3. mixed-mode constructions (arXiv:astro-ph/95XXXXX) ``` observed abundances of $^4$He, $^3$He, and D ``` --- ## Markup of necessity: glyphs - kerning dots into a diagonal ellipsis (arXiv:1502.XXXXX) ```tex \def\qdots{\mathinner{ \mkern1mu\raise1pt\vbox{\kern7pt\hbox{.}} \mkern2mu\raise4pt\hbox{.} \mkern2mu\raise7pt\hbox{.}\mkern1mu}} ``` - old style QED square ```latex \fbox{\phantom{\rule{.7ex}{.7ex}}} ``` We now have Unicode, U+22F0 (⋰), U+25A1 (□). How can we connect them?
- A general problem of connecting absolute positioning to structural counterparts - XY/tikz/custom diagrams to SVG - hand-crafted title pages to a narrative tree - geometric arrangements of custom "mini pages" - meaningful borders and lines (QED square, inference rules) --- ## Math expressions are trees - a layout tree and an operator tree --- count:false ## Math expressions
are
have trees - a layout tree and an operator
tree
graph. --- count:false ## Math expressions
are
have trees - a layout tree and an operator
tree
graph.
(arXiv:hep-th/00XXXXX)
(arXiv:gr-qc/00XXXXX)
(arXiv:1409.XXXX)
--- ## LaTeXML over arXiv: what is missing? **Coverage:** "brutal long tail" * over 10,000 missing packages and classes * 18.4% documents encounter unknown macros * over 60,000 *distinct* unknown macros, 1.5 million total **Most needed packages** * tikz-cd, epic, biblatex, arydshln, mhchem, tabu, mdframed, eepic * about 5% put together **Math syntax** * ~1 **B**illion math formulas * 1.25 million documents have 1 or more unparsable formulas * more than 23 million formulas overall can not yet be parsed. --- # Topics Technologies and spice: 1. A quick on-ramp 2. arXiv and HTML - history and sustainability 3. LaTeXML and LaTeX - get boring, stay better 4.
MathML in 2023
- it's time! --- # The sticky need for MathML We need a standard representation for math layout. Why? - Interoperable JS services - visual math editors, - math manipulation and interactivity, - search - Performance! - Interoperable Accessibility - avoid single point of vendor failure - AI is ingesting all WWW data - moslty without math syntax properly handled - LaTeX source gets us off the ground - but clean data lands the plane --- # Formula to MathML Tools
temml
javascript
link
TeXzilla
javascript
link
texmath
haskell
link
Space Math
javascript
link
latex2mathml
python
link
many others...
--- class:mid-list # MathML History - First edition: 1998 - Full browser support: a single week in 2013 - then Chrome yanks its implementation - MathML winter in 2013-2021 - MathML Core published in 2021 - adds algorithms for browser rendering - down to
32 elements
! (from 195) - Igalia deliver Chromium support in January 2023. - MathML 4 is in active development today - MathML Intent for advanced (speech) accessibility - MathML Core continually refined, now down to 30 elements --- # MathML Today
https://developer.mozilla.org/en-US/docs/Web/MathML#browser_compatibility --- class: center ## Aside: Some LaTeX bits A day in the life with arXiv data
--- class: center ## Aside: Some LaTeX 3 (in jest)
--- class: center ## Aside: Some LaTeX 3 (in jest)
--- class: center ## Aside: Some LaTeX 2 (in jest)
--- class: center ## Aside: Some TeX (in jest)
https://ctan.org/pkg/xii --- class: center ## Lessons learned?
"objects in the rear view mirror may appear closer than they are"