Math-rich NLP on Billion Token Corpora

AITP16

Math NLP

Corpora

arXiv

LaTeXML

http://creativecommons.org/licenses/by/4.0/

Research overview on large-scale natural language processing on math-rich scientific documents, as carried out by the KWARC research group.

2016-04-05

class: center, middle

# Math-rich NLP on Billion Token Corpora
## AITP 2016

.footnote[**Deyan Ginev**]

---
<div class="center middle">
<img src='/aitp16/kwarc_overview.png' width='800' />
</div>
---

# Agenda

1. <span style='color:blue;'>Math-rich NLP</span>
2. Datasets and resources
3. Toolkits and best practices
4. Example MathNLP Tasks
5. Large-scale processing

---

# Math-rich NLP

* _"Natural Language Processing"_ - umbrella term for computer reading and understanding of human language.
 * _"Math-rich NLP"_ - focus on interplay between math and language modalities.
 * In practice, analyses discourse in the hard sciences and engineering.

---

## Academic state of MathNLP
 * Niche domain - in both funding and manpower
 * Linguistic phenomena of math expressions are largely unstudied at scale
 * Not even exhaustive accounts of notations, symbols, operators
 * Various aspects in infancy (datasets, representations, methods)
 * MathIR is the slightly more popular cousin

---
class: middle

# Challenge: We need a solid foundation

1. Datasets (shared, large, high quality)
 2. Best practices ("Open Science")
 3. Toolkits (free, open source)

---

# Agenda

1. Math-rich NLP

* Challenge: Establish a solid foundation
2. <span style='color:blue;'>Datasets and resources</span>
3. Toolkits and best practices
4. Example MathNLP Tasks
5. Large-scale processing

---

## MathNLP resources, by genre (1/2)

- encyclopedic repositories (largely open)

* Wikipedia, PlanetMath, MathWorld, nLab, <br>Encyclopedia of Math, DLMF, MathHub
    * indexed ~100k relevant pages (NNexus)

- scholarly reviews (largely closed)

* ZBMath, MathSciNet
    * indexed ~3 million reviews (MathWebSearch)

- collaborative e-learning and solving

* blogs - dig out of https://commoncrawl.org/
    * MathOverflow
      * https://archive.org/download/stackexchange
      * 100k forum posts, 400k comments

---

## MathNLP resources, by genre (2/2)
### Conference and Journals articles
  - Preprints

* arXiv.org - Cornell's preprint archive
      * http://arxiv.org/help/bulk_data_s3
      * 1,03 million TeX sources (upto 02.2016)
    * biorxiv (PDF only)

- Proceedings

* behind publisher paywalls
      * research isn't immediately "fair use"
      * also applies to textbook, reference manual, etc. genres
---
class: center, middle

Fig 3. Percentage of papers published by the five major publishers, by discipline in the Natural and Medical Sciences, 1973–2013.

** "The Oligopoly of Academic Publishers in the Digital Era" **,

_Vincent Larivière, et al. _

---

## MathNLP resources, by domain
  * Narrative techniques and structures are often specific to a scientific domain
  * Examples:

1. Theorems and proofs almost missing from experimental domains

* e.g. laser and astro physics

2. Methods and discussion sections (data-driven) missing from theoretical domains.

* e.g. category theory and string theory

3. While K12 notation is (almost) universal, domain-specific math notation is also common

* e.g. bra-ket in quantum physics $\langle\phi\mid\psi\rangle$
---

## MathNLP resources, by representation
  1. PDF
    * most common for archival and distribution,
    * often requires OCR, reconstruction work
  2. LaTeX sources
    * most common for the hard sciences
    * require a turing-complete typesetting pass
  3. Word sources
    * common in life sciences
    * are there any leads for potential datasets?
  4. HTML5
    * de-facto standard for any modern web-authored resource
    * often with latex equations
  5. Older artifacts, other?
    * e.g. entire documents as bitmap images

---

## Challenge: Can we standardize our datasets?

- My suggestion: <span style="color: blue">Let's use arXiv.org</span>
- NTCIR Math task Since 2013: based on 100,000 arXiv HTML5 sources
- There is more to do

* solve copyright concerns
 * publish a well-documented dataset with a DOI (semi-annually)
 * invite everyone to join (this talk!)
 * keep improving [conversion success rates](http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html)

We're at 61,71% error-free conversions, and 92,74% have HTML5 results.
---

## An HTML5 dataset of arXiv.org (<span style="color: red">preliminary</span>)

|         |     | |
| -------------:|:-------------|
| 966,000      | HTML5 arXiv docs |
|||
| 58,500,000      | paragraphs      |
|||
| 234,200,000 | sentences |
|||
| 30,700,000 | inline citations |
|||
| 351,500,000 | formulas |

---

## A plain-text token model of arXiv.org (<span style="color: red">preliminary</span>)

|         |     | |
| -------------:|:-------------|
| 4,260,000,000 | word tokens |
| 351,500,000 | 'mathformula' tokens |
| 3,390,000     | unique tokens |
| 820,000       | unique and >5 |

In comparison (as bundled by GloVe project):

- Common Crawl (840B tokens, 2.2M vocab, cased)
- Wikipedia 2014 + Gigaword 5 (1.6B+4.3B tokens, 400K vocab, uncased)
- Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased)

---
## Demo model: GloVe (<span style="color: red">preliminary</span>)

** Global Vectors for Word Representation**

> GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

* Project: http://nlp.stanford.edu/projects/glove/
* Source: https://github.com/stanfordnlp/GloVe

---

## Demo model: GloVe (<span style="color: red">preliminary</span>)

---

## Demo model: GloVe (<span style="color: red">preliminary</span>)

---

## Demo model: GloVe (<span style="color: red">preliminary</span>)

---

# Agenda

1. Math-rich NLP

* Challenge: Establish a solid foundation
2. Datasets and resources

* Challenge: Standardize community datasets
3. <span style='color:blue;'>Toolkits and best practices</span>
4. Example MathNLP Tasks
5. Large-scale processing

---

## <span style="color: red";>Warning!</span> Badly evaluated methods are daydreaming

To meaningfully compare we need:
 1. Same dataset
 2. Same preprocessing
 3. Same evaluation techniques
 4. Ideally both methods are tried together
 5. And compared to / aligned with the literature

**Challenge:** Engineer an academic arena for MathNLP to grow a community and escape infancy.

---

## Reuse and build on prior work

Embrace "Open Science" principles:

1. Open Access publishing
   * Can others find and access our results?

2. Open Data experiments
   * Can others scrutinize and reuse our datasets?

3. Open Source research software
   * Can others reproduce and verify the experimental setup and methods?

** Challenge 1: ** Package results so that any grad student can reproduce them with a week of effort.

** Challenge 2: ** Publish as means to open new research avenues for the community, not to ensure "vendor lock-in".

---

## Provide toolkits back to the ecosystem
 Our KWARC group is actively working on a few:
 1. [CorTeX](https://github.com/dginev/CorTeX), build system <a href='/aitp16/rust.jpg'><img alt="Rust logo" style="border-width:0" width="30" src="/aitp16/rust-logo.png" /></a>
  * Automates representation transitions
  * Poor-man's Apache Spark
  * Last resort if HTML5 dataset is impractical to share

2. [llamapun](https://github.com/KWARC/llamapun), MathNLP toolkit <a href='/aitp16/rust.jpg'><img alt="Rust logo" style="border-width:0" width="30" src="/aitp16/rust-logo.png" /></a>
  * common language and mathematics processing algorithms
  * Based on DOM + normalized text interplay

3. [KAT](https://github.com/KWARC/KAT), An Annotation Tool for STEM Documents <img alt="Rust logo" style="border-width:0" width="25" src="/aitp16/js-logo.png" />

4. [NNexus](https://github.com/dginev/nnexus), concept-discovery and auto-linking <img alt="Rust logo" style="border-width:0" width="30" src="/aitp16/perl-logo.jpg" />
  * Gazetteer of encyclopedic concepts from 7 resources
  * Naive NER for English math texts ("concept discovery")
  * Multi-link annotations to concept definitions

---

# Agenda

1. Math-rich NLP

* Challenge: Establish a solid foundation
2. Datasets and resources

* Challenge: Standardize community datasets
3. Toolkits and best practices

* Challenge: Reproducible, open science
4. <span style='color:blue;'>Example MathNLP Tasks</span>
5. Large-scale processing

---
## Statistical and Formal Semantics

* usually shallow/statistical and deep/formal matchups
* the space is open for hybrids - hot topic at AITP16!

---

# Math NLP Tasks, examples

* Language detection
 * Tokenization
 * Named entity recognition (NER)
 * Part-of-speech tagging (POST)
 * Phrase structure parsing
 * Automatic summarization
 * Sentiment analysis
---

## Task: Tokenization
 > "The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements."

* Is $x^\prime$ a complex variable name, or a first derivative?
 * Is $\frac{d}{dx}$ an embellished differential operator, or a fraction?

* We can instantly disambiguate this for calculus.
 * Is that true for the other [>450 subfields](https://en.wikipedia.org/wiki/Glossary_of_areas_of_mathematics)?
---

## Task: Named Entity Recognition (NER)
 > "Locate and classify elements in text into predefined categories, such as names of persons, locations, mathematical concepts."

* Extract the definiendum, possibly with its definiens from a definition
 * Extract all mathematical tokens (often compound) for mathematical constants, e.g.

* $\mathbb{N}, \mathbb{R}$ in mathematics,
   * $c, h$ in physics,
   * $s, m,\mathrm{sol}$ in measurement units.
---

## Task: Part-of-speech tagging (POST)

> Obtain and tag each lexeme with a grammatical tag (nouns, verbs, adjectives, adverbs, etc.)

* As an example, is a formula's grammatical role a:

* sentence: _"This is the case when $x>0$."_

* modified noun: _"For $x>0$, we know that..."

* numeral adjective: "Take the first $2k+1$ prime numbers."

---

## Task: Phrase structure/formula parsing

* _Phrase structure parsing_ in NLP - uncover the "deep structure" of a sentence
* Semantic tree with underspecified symbols, usually 'typed' by POS tags

* Layout trees
  * "How does it look?"
  * Largely specified in latex - fractions, root arguments, scripts, etc.

* Operator trees
  * "What does it mean?"
  * build applicative trees of operators and operands, as implied by fixity, precedence, argument types, in current notation(s).

* Curious: LaTeXML reports it failed to parse 15.97m out of 351.5m from arXMLiv, a success rate of 95.45%.

---

## Task: Formula parsing - operator tree subtasks

* represent vagueness
 - ellipses: $x_1, \ldots, x_n$
 - elisions: $\displaystyle\sum_x x_i$

* deal with syntactic and semantic ambiguity:

|  example             |  meaning | fixity |
|:-------------:|:-------------|-----:|
| $x \mid 3$   | divisible by | infix |
| $\\{x \mid x\textrm{ is prime}\\}$ | set predicate | circumfix |
| $\left.\frac{1}{y}\right\vert^4_2$ | evaluated at | postfix |

* fragments such as _"is prime"_ need to remain **underspecified** until handled correctly.

* Many interesting phenomena and notations. Starting to go deep.

---

## Task: Going Deeper

1. Domain pragmatics
  
  * how many papers will define $\mathbb{R}?$
  * expected background knowledge of early stage graduate student

2. Papers as theories. An author can:

* define new terms, symbols, operators, notations
  * recall existing knowledge via a recap or citation
  * Conjecture and/or derive new propositions
   * via formal proof
   * via experimental data (figures/tables)
  * offer "axiomatic" summaries of central results

3. Intuitively come after (or at least together with) shallow tasks.

4. We're calling this incremental view of enrichment

** "The Flexiformalist Manifesto" **
---

## Meta-task: <span style="color: gold;">Gold Standards</span> for training and evaluation

1. Nothing really ready
 2. One idea: use annotations from LaTeX sources
   - "abstract" for text summarization tasks
   - "definitions", "theorems", etc for classification tasks
 3. Alternative: invest in annotation
   - "KAT" toolkit for web-based annotation of math-rich documents
   - Stand-off annotations require some ontology and storage solutions

---

# Agenda

1. Math-rich NLP

* Challenge: Establish a solid foundation
2. Datasets and resources

* Challenge: Standardize community datasets
3. Toolkits and best practices

* Challenge: Reproducible, open science
4. Example MathNLP Tasks
  
  * Challenge: Reach parity with NLP state-of-art, adapt methods

5. <span style='color:blue;'>Large-scale processing</span>

---

# Large-scale processing

1. Certain corpus-level tasks are intractable on a single CPU
2. Solutions for scaling up depend on a number of constraints: monetary, hardware, admin regulations
3. Any resource is a potential bottleneck - CPU/GPUs, RAM, HDD I/O

---

## Two Examples at Scale
 1. Processing arXiv with LaTeXML
   - takes just under **5 CPU years**
   - CorTeX setup uses ~420 CPUs to finish in 100 hours (4-5 days).
   - 1,52% of jobs timed out after 20 minutes

2. GloVe model on arXiv
   - Token model generation (~1 day single-threaded)
   - GloVe model generation (~6 hours, 10 threads)

---
# Future challenges for MathNLP

Tried to motivate:

1. Establish a solid foundation

2. Standardize community datasets

3. Reproducible, open science

4. Reach parity with NLP state-of-art, adapt methods

5. Smart computing under real-world constraints; minimize turnaround

---
class: center, middle

## Thank you!
**Questions?**