AITP16
Math NLP
Corpora
arXiv
LaTeXML
https://creativecommons.org/licenses/by/4.0/
Launching a preview installation for 1.797 million arXiv preprints, in HTML5. Goal: reintegrate into arXiv.org.
2022-02-07
class: center, middle ### Welcome to
*a preview platform for the arXMLiv dataset*
.footnote[**Deyan Ginev**] .remark-venue[SIGMathLing Seminar
Feb 7, 2022]
--- class: center, middle ### Welcome to the
Lab !
*an arXiv-endorsed platform for HTML5 e-prints*
.footnote[**Deyan Ginev**] .remark-venue[SIGMathLing Seminar
March 21, 2022]
--- ## Agenda *
Quick History of arXMLiv
* Technology Stacks (latexml, cortex, ar5iv) * Welcome to ar5iv.org * Future: "X" marks the spot * Social reception / Thank you! --- ## Quick History of arXMLiv * KWARC+NIST since 2006, converting arXiv to HTML via LaTeXML. * LaTeXML: main workhorse of DLMF, [arxiv-vanity.com](arxiv-vanity.com), [ARQMath](https://www.cs.rit.edu/~dprl/ARQMath/) * arXiv has accelerating growth / moving target
--- ## Quick History of arXMLiv, 2022 * SIGMathLing annual releases: [dataset, embeddings, statements](https://sigmathling.kwarc.info/resources/arxmliv/) * 97.6% of articles can be converted, 74.59% error-free (**+2.5%** in 2021)
--- ## Why convert to HTML? We want to do better than PDF or TeX. * Machine-readable * searchable * better data for AI models * Semantics-preserving * capture author intent in the output markup * "Active Document Paradigm" (Planetary) * Accessible * Michael's "Math for Blind" (2008) talk is [on youtube](https://www.youtube.com/watch?v=cDzIpFPNPpI). --- ## Agenda * Quick History of arXMLiv *
Technology Stacks (latexml, cortex, ar5iv)
* Welcome to ar5iv.org * Future: "X" marks the spot * Social reception / Thank you! --- # LaTeXML https://dlmf.nist.gov/LaTeXML/ * "a free, public domain software, which converts LaTeX documents
to XML, HTML, EPUB, JATS and TEI" [(wiki)](https://en.wikipedia.org/wiki/LaTeXML) * version **0.8.6**, a production-ready Perl application * implements a variant of the TeX typesetting engine * and a small part of the CTAN package ecosystem --- class: center # LaTeXML Architecture
--- # LaTeXML to CorTeX ### Problem 1: Speed * sadly, latexml is slower than pdflatex * an average arXiv article takes ~2 minutes to convert * (mean, not median) * arXiv upto 2022 would take almost ~7 CPU years * 💡 We need to distribute work * last run used 132 workers ### Problem 2: Scale * What went wrong? How often? How severe? * 💡 We need to aggregate reports --- class: center # CorTeX build system
--- # CorTeX data model * seeded from [arXiv S3 sources](https://arxiv.org/help/bulk_data_s3) * preprint source: * `/data/arxmliv/1910/1910.06709/1910.06709.zip` * latexml result * `/data/arxmliv/1910/1910.06709/tex_to_html.zip` * HTML5, images, other assets * `cortex.log` conversion log --- class: center, padded-table # CorTeX tech stack component | task :-- |:-- | Rust | codebase PostgreSQL | corpus, service, task, history management ZeroMQ | protocol for distributed dispatch [LaTeXML::Plugin::Cortex](https://github.com/dginev/LaTeXML-Plugin-Cortex) | dockerized workers for latexml Rocket.rs | web frontend for runs and reports Redis | cache for slow reports *elbow grease* | ensure stability --- # CorTeX alternative [texmlbus](https://github.com/stamer/texmlbus) is a similar system * written in PHP * revival of the original arXMLiv build system, * I wish it existed back in 2010! * author: Dr. Heinrich Stamerjohanns * a founding arXMLiv member * supported by Overleaf, [TUG 2021 talk](https://www.tug.org/tug2021/assets/pdf/Heinrich-Stamerjohanns-slides.pdf) --- # Some CorTeX Insights (latexml + arXiv) **Coverage:** "brutal long tail" * 10,100 missing packages and classes * 21.3% documents encounter unknown macros * 74,000 *distinct* unknown macros, 1.6 million total **Most needed** * xy, tikz-cd, epic, overpic, arydshln, biblatex * about 8% put together **Math syntax** * 1.25 million documents have 1 or more unparsable formulas * more than 23 million formulas overall can not yet be parsed. --- # CorTeX to ar5iv * **2.2 TB** of arXiv source `zip` files * **3.8 TB** of `tex_to_html.zip` HTML5 data * cross-referenced MathML is [not free](https://twitter.com/dginev/status/1462863352741384198)! * neither is the SVG from computing Tikz, pstricks... * recently improved PNG quality * Can we rapidly develop improvements over the pages? * Can we share this enormous resource publicly? * Can it be *useful* to arXiv and its millions of readers? * Sep 26, 2021: **Let us find out!** --- class: center, padded-table # ar5iv tech stack component | task :-- |:-- | [ar5iv-css](https://github.com/dginev/ar5iv-css) | brand new e-Journal styling Rust | codebase Rocket.rs | web service used for static assets and redirects Redis | cache articles to avoid disk bottleneck arXMLiv | seed HTML5 data, also in SIGMathLing dataset *elbow grease* | ensure high-quality rendering --- ## Agenda * Quick History of arXMLiv * Technology Stacks (latexml, cortex, ar5iv) *
Welcome to ar5iv.org
* Future: "X" marks the spot * Social reception / Thank you! --- class: middle ### ar5iv.org Announcement (1) Announced on Jan 31, 2022. People... [got excited](https://twitter.com/dginev/status/1488157927001268231)? As of March 20:
--- ### ar5iv.org Announcement (2) * Overwhelmingly loved (Thank you!!!) * PDF vs HTML debates * Licensing debates * Be mindful of harm! * article versions, * author lists, * todo notes, * author control, * reputation damage from bad rendering --- # ar5iv.org Features * 1.797 million HTML5 preprints via LaTeXML, with e-Journal styling * *mini* article viewer * inline citations * figure zoom * margin notes with highlights * ar5iv footer * dark theme * "Feeling lucky?", adjacent article navigation * conversion log, link to arxiv original
--- class: center,padded-table # ar5iv examples (1) By size | link :-- | :-- 7 pages | [1801.04274](https://ar5iv.org/html/1801.04274) 12 pages | [0802.1189](https://ar5iv.org/html/0802.1189) 15 pages | [2006.00995](https://ar5iv.org/html/2006.00995) 48 pages | [0907.2021](https://ar5iv.org/html/0907.2021) 92 pages | [2101.05331](https://ar5iv.org/html/2101.05331) 419 pages(!) | [2105.10386](https://ar5iv.org/html/2105.10386) --- class: center,padded-table # ar5iv examples (2) By feature | :-- | :-- sub-figures | [1602.07188](https://ar5iv.org/html/1602.07188) sub-listings | [1307.6769](https://ar5iv.org/html/1307.6769) SVG from Tikz | [2105.04026](https://ar5iv.org/html/2105.04026#S1.F1.pic1) calculi | [1408.6367](https://ar5iv.org/html/1408.6367) ... | ... --- class: center,padded-table # ar5iv examples (3) By language | link :-- | :-- Bulgarian (section VI) | [1602.06114](https://ar5iv.org/html/1602.06114) Czech | [1608.03295](https://ar5iv.org/html/1608.03295) Estonian | [2010.02636](https://ar5iv.org/html/2010.02636) Finnish | [1702.00277](https://ar5iv.org/html/1702.00277) French | [1710.09322](https://ar5iv.org/html/1710.09322) German | [physics/0512034](https://ar5iv.org/html/physics/0512034) Hungarian | [hep-ph/0008011](https://ar5iv.org/html/hep-ph/0008011) Italian | [2105.04227](https://ar5iv.org/html/2105.04227) Japanese | [2102.10993](https://ar5iv.org/html/2102.10993) Russian | [1911.02370](https://ar5iv.org/html/1911.02370) Spanish | [1909.12119](https://ar5iv.org/html/1909.12119) Swedish | [1303.0939](https://ar5iv.org/html/1303.0939) --- ## Agenda * Quick History of arXMLiv * Technology Stacks (latexml, cortex, ar5iv) * Welcome to ar5iv.org *
Future: "X" marks the spot
* Social reception / Thank you! --- ## Agenda * Quick History of arXMLiv * Technology Stacks (latexml, cortex, ar5iv) * Welcome to ar5iv.
labs.arxiv
.org *
Future
Present: "X" marks the spot
* Social reception / Thank you! --- class:middle,center ## "X" marks the spot - Feb 21, 2022
--- ## Agenda * Quick History of arXMLiv * Technology Stacks (latexml, cortex, ar5iv) * Welcome to ar5iv.
labs.arxiv
.org *
Future
Present: "X" marks the spot *
Future directions
* Social reception / Thank you! --- ## Future directions (1) **arXMLiv dataset (02.2022 release)** - 1.797 million articles - **2.5 TB** of HTML5 (279 GB on download). - 877,606,415 `
` elements - 358 ZIP bundles, grouped by year-month (YYMM) - 3 severities, per latexml: - no problems (0.31 M) - warnings (1.06 M) - errors (0.41 M) --- ## Future directions (2) **ar5iv Goals** *
**Reintegrate** with arXiv.org
✅ *
we need the stars to align
✅ * "good enough" LaTeXML **coverage** * aim: 80% or more (`xy.sty`, `tikz-cd.sty`) * we need raw interpretation * "good enough" LaTeXML **fidelity** * we need a "Scholarly" HTML * we need community feedback * "**MathML**-native": correct, compact trees * we need browser support (Chrome!) * we (I?) need research breakthroughs in parsing --- ## Future directions (3) **More ar5iv Goals** * arXiv-native * use latest article versions * provide preview on submission to authors * a modern "Article Viewer" * competitive reading experience * "get inspired" by eLife, Springer Open, Authorea... * strategic auxiliary extensions * math search * other arXivLabs projects * rich NLP applications * the world has started making [ar5iv plugins](https://twitter.com/polymonyrks/status/1489990675903180800) --- ## Future directions (4) **Ecosystem Goals** * "Take what we do *well* and do it **great**" * LaTeXML is "secretely" undergoing a Rust rewrite (`rtx`) * 32,000 lines of Rust already written, passing **10%** of core tests. * Expected between 1-to-2 magnitudes of speedup * Ambitious new math grammar embracing ambiguity and underspecification * We are still not *inviting* enough. - More and better documentation - More and better error reporting, fallback behavior * We are still *overwhelmed*. - Find simpler data models - Do not encourage advanced TeX use - Exception: helpful *together with* advanced latexml use --- ## Agenda * Quick History of arXMLiv * Technology Stacks (latexml, cortex, ar5iv) * Welcome to ar5iv.
labs.arxiv
.org *
Future
Present: "X" marks the spot * Future directions *
Social reception / Thank you!
--- class: middle ### "How it started"
*Thank you?*
2010, public [suckless.org](https://en.wikipedia.org/wiki/Suckless.org) mailing list.
In [response](http://lists.suckless.org/dev/1007/5029.html) to Catalin David, a KWARCie during 2008-2012:
---
"How it's going": Thank you!
---
"How it's going": Thank you!!