AITP16
Math NLP
Corpora
arXiv
LaTeXML
https://creativecommons.org/licenses/by/4.0/
The scientific statements annotated in papers from arXiv.org as a classification task. Baselines, ablations and live showcase.
2021-04-12
class: center, middle ## Scientific Statement Classification
over arXiv.org
.footnote[**Deyan Ginev**] .remark-venue[SIGMathLing Seminar
April 11, 2021]
--- ## Quick History of arXMLiv * KWARC+NIST since 2006, converting arXiv to HTML via LaTeXML. * LaTeXML: main workhorse of DLMF, [arxiv-vanity.com](arxiv-vanity.com), [ARQMath](https://www.cs.rit.edu/~dprl/ARQMath/) * arXiv has accelerating growth / moving target
--- ## Quick History of arXMLiv, 2020 * SIGMathLing annual releases: [dataset, embeddings, statements](https://sigmathling.kwarc.info/resources/arxmliv/) * \>96% of sources can be converted, now \>70% error-free
--- ### Increase in using arXiv for Neural NLP 1. KWARC/NIST: [Scientific Statement Classification](https://www.aclweb.org/anthology/2020.lrec-1.153/) 2. AllenAI: [SciREX](https://github.com/allenai/SciREX) (uses latexml), [s2orc](https://github.com/allenai/s2orc), [SciTLDR](https://arxiv.org/abs/2004.15011) 3. OpenAI: [ATP for Metamath](https://arxiv.org/abs/2009.03393) 4. Google: [Autoformalization](https://link.springer.com/chapter/10.1007/978-3-030-53518-6_1) effort 5. Montréal: [Abstractive Neural Document Summarization](https://arxiv.org/abs/1909.03186) 6. Uni Maryland: [Logical structure](https://arxiv.org/abs/1709.00770) 7. a surprising variety of other incremental results ### arXiv is Openly Available as TeX and PDF - [s3](https://arxiv.org/help/bulk_data_s3) or [kaggle](https://www.kaggle.com/Cornell-University/arxiv) downloads - but you save a month+ of work by reusing our [html data](https://sigmathling.kwarc.info/resources/arxmliv/) --- count: false ### Increase in using arXiv for Neural NLP 1. KWARC/NIST: [Scientific Statement Classification](https://www.aclweb.org/anthology/2020.lrec-1.153/)
--- ## Agenda 1.
arXiv statements
2. Posing a classification task 3. Baselines and ablations 4. Live showcase --- ## arXiv statements - how many? - Whitelisted AMS environments - defined via `\newtheorem` - filter via latexml+llamapun - manually curate for e.g. `{mainthm}`, `{xthm}`, ... - Select only **leading paragraph** - 96.46% of which fit in a 480 token window - multiple paragraphs have noisier signal - collected a total of **10.5 million** paragraphs - in 50 classes --- ## arXiv statements - 50 classes .center.padded-table.boxed[ | | | | :--------|:----------|:---------|:------|:----------- abstract | condition | exercise | lemma | proposition acknowledgement | conjecture | expansion | method | question affirmation | constraint | expectation | notation | related work answer | convention | experiment | note | remark assumption | corollary | explanation | notice | result bound | criterion | fact | observation | rule case | definition | hint | overview | solution claim | demonstration | introduction | principle | step comment | discussion | issue | problem | summary conclusion | example | keywords | proof | theorem ] --- ## Agenda 1. arXiv statements 2.
Posing a classification task
3. Baselines and ablations 4. Live showcase --- ## Classification setup - A rare case of existing large-scale labeled samples (10.5m) - Use standard 80/20 train/test split - Annotation of author *intention* rather than reader *description*. - Questions: - which classes are linguistically separable? - does context aid classification? - does math syntax aid classification? --- ## arXiv statements - 50 classes .center.padded-table.boxed[ | | | | :--------|:----------|:---------|:------|:----------- abstract | condition | exercise | lemma | proposition acknowledgement | conjecture | expansion | method | question affirmation | constraint | expectation | notation | related work answer | convention | experiment | note | remark assumption | corollary | explanation | notice | result bound | criterion | fact | observation | rule case | definition | hint | overview | solution claim | demonstration | introduction | principle | step comment | discussion | issue | problem | summary conclusion | example | keywords | proof | theorem ] --- count:false ## 50 classes, BiLSTM baseline, F1 > 0.45 .center.boxed[ | | | | :--------|:----------|:---------|:------|:----------- abstract
(0.95) | | | lemma
(0.72) | acknowledgement
(1.00) | | | method
(0.49) | question
(0.84) | | | | related work
(0.63) | | | | remark
(0.74) | | | | result
(0.74) | definition
(0.91) | | | | | introduction
(0.90) | | | | | problem
(0.47) | conclusion
(0.73) | example
(0.62) | keywords
(0.84) | proof
(0.90) | theorem
(0.62) ] --- class:center ## Tease apart: regroup into 13 "nests"
Data at: https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/ --- ## Agenda 1. arXiv statements 2. Posing a classification task 3.
Baselines and ablations
4. Live showcase --- # GloVe embeddings (work from 2018) Calculate "Global vectors for word embeddings" embeddings over arXiv - 300 dimensional, - uncased, except for math syntax, - words with frequency 5+ in arXMLiv - final vocabulary of 1 million words, over 11 billion tokens
.footnote.left[Available at:
https://sigmathling.kwarc.info/resources/arxmliv-embeddings-082018/] --- class: center ## Baselines
--- count: false class: center ## Baselines
.footnote[ Best:
] --- class:middle ## Ablations - Glove
.footnote.right[ [source](https://prodg.org/blog/deep_expertise/) ] --- class:middle ## Ablations - training data size
.footnote.right[ [source](https://prodg.org/blog/deep_expertise/) ] --- class:middle ## Future data outlook (arXMLiv 2019)
.footnote.right[ [source](https://prodg.org/blog/arxiv_headings/) ] --- ## Agenda 1. arXiv statements 2. Posing a classification task 3. Baselines and ablations 4.
Live showcase
https://corpora.mathweb.org/classify_paragraph