AITP16
Math NLP
Corpora
arXiv
LaTeXML
http://creativecommons.org/licenses/by/4.0/
Research overview on large-scale natural language processing on math-rich scientific documents, as carried out by the KWARC research group.
2016-04-05
class: center, middle # Math-rich NLP on Billion Token Corpora ## AITP 2016 .footnote[**Deyan Ginev**]
---
--- # Agenda 1.
Math-rich NLP
2. Datasets and resources 3. Toolkits and best practices 4. Example MathNLP Tasks 5. Large-scale processing --- # Math-rich NLP * _"Natural Language Processing"_ - umbrella term for computer reading and understanding of human language. * _"Math-rich NLP"_ - focus on interplay between math and language modalities. * In practice, analyses discourse in the hard sciences and engineering. --- ## Academic state of MathNLP * Niche domain - in both funding and manpower * Linguistic phenomena of math expressions are largely unstudied at scale * Not even exhaustive accounts of notations, symbols, operators * Various aspects in infancy (datasets, representations, methods) * MathIR is the slightly more popular cousin --- class: middle # Challenge: We need a solid foundation 1. Datasets (shared, large, high quality) 2. Best practices ("Open Science") 3. Toolkits (free, open source) --- # Agenda 1. Math-rich NLP * Challenge: Establish a solid foundation 2.
Datasets and resources
3. Toolkits and best practices 4. Example MathNLP Tasks 5. Large-scale processing --- ## MathNLP resources, by genre (1/2) - encyclopedic repositories (largely open) * Wikipedia, PlanetMath, MathWorld, nLab,
Encyclopedia of Math, DLMF, MathHub * indexed ~100k relevant pages (NNexus) - scholarly reviews (largely closed) * ZBMath, MathSciNet * indexed ~3 million reviews (MathWebSearch) - collaborative e-learning and solving * blogs - dig out of https://commoncrawl.org/ * MathOverflow * https://archive.org/download/stackexchange * 100k forum posts, 400k comments --- ## MathNLP resources, by genre (2/2) ### Conference and Journals articles - Preprints * arXiv.org - Cornell's preprint archive * http://arxiv.org/help/bulk_data_s3 * 1,03 million TeX sources (upto 02.2016) * biorxiv (PDF only) - Proceedings * behind publisher paywalls * research isn't immediately "fair use" * also applies to textbook, reference manual, etc. genres --- class: center, middle
Fig 3. Percentage of papers published by the five major publishers, by discipline in the Natural and Medical Sciences, 1973–2013. ** "The Oligopoly of Academic Publishers in the Digital Era" **, _Vincent Larivière, et al. _ --- ## MathNLP resources, by domain * Narrative techniques and structures are often specific to a scientific domain * Examples: 1. Theorems and proofs almost missing from experimental domains * e.g. laser and astro physics 2. Methods and discussion sections (data-driven) missing from theoretical domains. * e.g. category theory and string theory 3. While K12 notation is (almost) universal, domain-specific math notation is also common * e.g. bra-ket in quantum physics $\langle\phi\mid\psi\rangle$ --- ## MathNLP resources, by representation 1. PDF * most common for archival and distribution, * often requires OCR, reconstruction work 2. LaTeX sources * most common for the hard sciences * require a turing-complete typesetting pass 3. Word sources * common in life sciences * are there any leads for potential datasets? 4. HTML5 * de-facto standard for any modern web-authored resource * often with latex equations 5. Older artifacts, other? * e.g. entire documents as bitmap images --- ## Challenge: Can we standardize our datasets? - My suggestion:
Let's use arXiv.org
- NTCIR Math task Since 2013: based on 100,000 arXiv HTML5 sources - There is more to do * solve copyright concerns * publish a well-documented dataset with a DOI (semi-annually) * invite everyone to join (this talk!) * keep improving [conversion success rates](http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html) We're at 61,71% error-free conversions, and 92,74% have HTML5 results. --- ## An HTML5 dataset of arXiv.org (
preliminary
) | | | | | -------------:|:-------------| | 966,000 | HTML5 arXiv docs | ||| | 58,500,000 | paragraphs | ||| | 234,200,000 | sentences | ||| | 30,700,000 | inline citations | ||| | 351,500,000 | formulas | --- ## A plain-text token model of arXiv.org (
preliminary
) | | | | | -------------:|:-------------| | 4,260,000,000 | word tokens | | 351,500,000 | 'mathformula' tokens | | 3,390,000 | unique tokens | | 820,000 | unique and >5 | In comparison (as bundled by GloVe project): - Common Crawl (840B tokens, 2.2M vocab, cased) - Wikipedia 2014 + Gigaword 5 (1.6B+4.3B tokens, 400K vocab, uncased) - Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) --- ## Demo model: GloVe (
preliminary
) ** Global Vectors for Word Representation** > GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. * Project: http://nlp.stanford.edu/projects/glove/ * Source: https://github.com/stanfordnlp/GloVe --- ## Demo model: GloVe (
preliminary
)
--- ## Demo model: GloVe (
preliminary
)
--- ## Demo model: GloVe (
preliminary
)
--- # Agenda 1. Math-rich NLP * Challenge: Establish a solid foundation 2. Datasets and resources * Challenge: Standardize community datasets 3.
Toolkits and best practices
4. Example MathNLP Tasks 5. Large-scale processing --- ##
Warning!
Badly evaluated methods are daydreaming To meaningfully compare we need: 1. Same dataset 2. Same preprocessing 3. Same evaluation techniques 4. Ideally both methods are tried together 5. And compared to / aligned with the literature **Challenge:** Engineer an academic arena for MathNLP to grow a community and escape infancy. --- ## Reuse and build on prior work Embrace "Open Science" principles: 1. Open Access publishing * Can others find and access our results? 2. Open Data experiments * Can others scrutinize and reuse our datasets? 3. Open Source research software * Can others reproduce and verify the experimental setup and methods? ** Challenge 1: ** Package results so that any grad student can reproduce them with a week of effort. ** Challenge 2: ** Publish as means to open new research avenues for the community, not to ensure "vendor lock-in". --- ## Provide toolkits back to the ecosystem Our KWARC group is actively working on a few: 1. [CorTeX](https://github.com/dginev/CorTeX), build system
* Automates representation transitions * Poor-man's Apache Spark * Last resort if HTML5 dataset is impractical to share 2. [llamapun](https://github.com/KWARC/llamapun), MathNLP toolkit
* common language and mathematics processing algorithms * Based on DOM + normalized text interplay 3. [KAT](https://github.com/KWARC/KAT), An Annotation Tool for STEM Documents
4. [NNexus](https://github.com/dginev/nnexus), concept-discovery and auto-linking
* Gazetteer of encyclopedic concepts from 7 resources * Naive NER for English math texts ("concept discovery") * Multi-link annotations to concept definitions --- # Agenda 1. Math-rich NLP * Challenge: Establish a solid foundation 2. Datasets and resources * Challenge: Standardize community datasets 3. Toolkits and best practices * Challenge: Reproducible, open science 4.
Example MathNLP Tasks
5. Large-scale processing --- ## Statistical and Formal Semantics
* usually shallow/statistical and deep/formal matchups * the space is open for hybrids - hot topic at AITP16! --- # Math NLP Tasks, examples * Language detection * Tokenization * Named entity recognition (NER) * Part-of-speech tagging (POST) * Phrase structure parsing * Automatic summarization * Sentiment analysis --- ## Task: Tokenization > "The process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements." * Is $x^\prime$ a complex variable name, or a first derivative? * Is $\frac{d}{dx}$ an embellished differential operator, or a fraction? * We can instantly disambiguate this for calculus. * Is that true for the other [>450 subfields](https://en.wikipedia.org/wiki/Glossary_of_areas_of_mathematics)? --- ## Task: Named Entity Recognition (NER) > "Locate and classify elements in text into predefined categories, such as names of persons, locations, mathematical concepts." * Extract the definiendum, possibly with its definiens from a definition * Extract all mathematical tokens (often compound) for mathematical constants, e.g. * $\mathbb{N}, \mathbb{R}$ in mathematics, * $c, h$ in physics, * $s, m,\mathrm{sol}$ in measurement units. --- ## Task: Part-of-speech tagging (POST) > Obtain and tag each lexeme with a grammatical tag (nouns, verbs, adjectives, adverbs, etc.) * As an example, is a formula's grammatical role a: * sentence: _"This is the case when $x>0$."_ * modified noun: _"For $x>0$, we know that..." * numeral adjective: "Take the first $2k+1$ prime numbers." --- ## Task: Phrase structure/formula parsing * _Phrase structure parsing_ in NLP - uncover the "deep structure" of a sentence * Semantic tree with underspecified symbols, usually 'typed' by POS tags * Layout trees * "How does it look?" * Largely specified in latex - fractions, root arguments, scripts, etc. * Operator trees * "What does it mean?" * build applicative trees of operators and operands, as implied by fixity, precedence, argument types, in current notation(s). * Curious: LaTeXML reports it failed to parse 15.97m out of 351.5m from arXMLiv, a success rate of 95.45%. --- ## Task: Formula parsing - operator tree subtasks * represent vagueness - ellipses: $x_1, \ldots, x_n$ - elisions: $\displaystyle\sum_x x_i$ * deal with syntactic and semantic ambiguity: | example | meaning | fixity | |:-------------:|:-------------|-----:| | $x \mid 3$ | divisible by | infix | | $\\{x \mid x\textrm{ is prime}\\}$ | set predicate | circumfix | | $\left.\frac{1}{y}\right\vert^4_2$ | evaluated at | postfix | * fragments such as _"is prime"_ need to remain **underspecified** until handled correctly. * Many interesting phenomena and notations. Starting to go deep. --- ## Task: Going Deeper 1. Domain pragmatics * how many papers will define $\mathbb{R}?$ * expected background knowledge of early stage graduate student 2. Papers as theories. An author can: * define new terms, symbols, operators, notations * recall existing knowledge via a recap or citation * Conjecture and/or derive new propositions * via formal proof * via experimental data (figures/tables) * offer "axiomatic" summaries of central results 3. Intuitively come after (or at least together with) shallow tasks. 4. We're calling this incremental view of enrichment ** "The Flexiformalist Manifesto" ** --- ## Meta-task:
Gold Standards
for training and evaluation 1. Nothing really ready 2. One idea: use annotations from LaTeX sources - "abstract" for text summarization tasks - "definitions", "theorems", etc for classification tasks 3. Alternative: invest in annotation - "KAT" toolkit for web-based annotation of math-rich documents - Stand-off annotations require some ontology and storage solutions --- # Agenda 1. Math-rich NLP * Challenge: Establish a solid foundation 2. Datasets and resources * Challenge: Standardize community datasets 3. Toolkits and best practices * Challenge: Reproducible, open science 4. Example MathNLP Tasks * Challenge: Reach parity with NLP state-of-art, adapt methods 5.
Large-scale processing
--- # Large-scale processing 1. Certain corpus-level tasks are intractable on a single CPU 2. Solutions for scaling up depend on a number of constraints: monetary, hardware, admin regulations 3. Any resource is a potential bottleneck - CPU/GPUs, RAM, HDD I/O --- ## Two Examples at Scale 1. Processing arXiv with LaTeXML - takes just under **5 CPU years** - CorTeX setup uses ~420 CPUs to finish in 100 hours (4-5 days). - 1,52% of jobs timed out after 20 minutes 2. GloVe model on arXiv - Token model generation (~1 day single-threaded) - GloVe model generation (~6 hours, 10 threads) --- # Future challenges for MathNLP Tried to motivate: 1. Establish a solid foundation 2. Standardize community datasets 3. Reproducible, open science 4. Reach parity with NLP state-of-art, adapt methods 5. Smart computing under real-world constraints; minimize turnaround --- class: center, middle ## Thank you! **Questions?**