Scientific Statement Classification over arXiv.org

AITP16

Math NLP

Corpora

arXiv

LaTeXML

https://creativecommons.org/licenses/by/4.0/

The scientific statements annotated in papers from arXiv.org as a classification task. Baselines, ablations and live showcase.

2021-04-12

class: center, middle
## Scientific Statement Classification <br>over arXiv.org
<br><br>

.footnote[**Deyan Ginev**]
.remark-venue[SIGMathLing Seminar<br>
April 11, 2021]

---

## Quick History of arXMLiv

* KWARC+NIST since 2006, converting arXiv to HTML via LaTeXML.
* LaTeXML: main workhorse of DLMF, [arxiv-vanity.com](arxiv-vanity.com), [ARQMath](https://www.cs.rit.edu/~dprl/ARQMath/)
* arXiv has accelerating growth / moving target
<br><br><br>
<div style="margin-left:-35px;"><img src="/sigmathling20/arxiv_growth_2020.png" width="650px"></div>

---

## Quick History of arXMLiv, 2020
* SIGMathLing annual releases: [dataset, embeddings, statements](https://sigmathling.kwarc.info/resources/arxmliv/)
* \>96% of sources can be converted, now \>70% error-free
<br><br><br>
<div style="margin-left:-55px;"><img src="/sigmathling20/arxmliv_history_2020.png" width="650px"></div>

---
### Increase in using arXiv for Neural NLP
  1. KWARC/NIST: [Scientific Statement Classification](https://www.aclweb.org/anthology/2020.lrec-1.153/)
  2. AllenAI: [SciREX](https://github.com/allenai/SciREX) (uses latexml), [s2orc](https://github.com/allenai/s2orc), [SciTLDR](https://arxiv.org/abs/2004.15011)
  3. OpenAI: [ATP for Metamath](https://arxiv.org/abs/2009.03393)
  4. Google: [Autoformalization](https://link.springer.com/chapter/10.1007/978-3-030-53518-6_1) effort
  5. Montréal: [Abstractive Neural Document Summarization](https://arxiv.org/abs/1909.03186)
  6. Uni Maryland: [Logical structure](https://arxiv.org/abs/1709.00770)
  7. a surprising variety of other incremental results

### arXiv is Openly Available as TeX and PDF
 - [s3](https://arxiv.org/help/bulk_data_s3) or [kaggle](https://www.kaggle.com/Cornell-University/arxiv) downloads
 - but you save a month+ of work by reusing our [html data](https://sigmathling.kwarc.info/resources/arxmliv/)
---
count: false
### Increase in using arXiv for Neural NLP
  1. KWARC/NIST: [Scientific Statement Classification](https://www.aclweb.org/anthology/2020.lrec-1.153/)

---
## Agenda

1. <span style='color:blue;'>arXiv statements</span>
2. Posing a classification task
3. Baselines and ablations
4. Live showcase
---
## arXiv statements - how many?

- Whitelisted AMS environments 
  - defined via `\newtheorem`
  - filter via latexml+llamapun
  - manually curate for e.g. `{mainthm}`, `{xthm}`, ...
- Select only **leading paragraph**
  - 96.46% of which fit in a 480 token window
  - multiple paragraphs have noisier signal
- collected a total of **10.5 million** paragraphs 
  - in 50 classes

---
## arXiv statements - 50 classes
.center.padded-table.boxed[
  
|           |          |       |
:--------|:----------|:---------|:------|:-----------
abstract | condition | exercise | lemma | proposition
acknowledgement | conjecture | expansion | method | question
affirmation | constraint | expectation | notation | related work
answer | convention | experiment | note | remark
assumption | corollary | explanation | notice | result
bound | criterion | fact |  observation  | rule
case | definition | hint | overview | solution
claim | demonstration | introduction | principle | step
comment | discussion | issue | problem | summary
conclusion | example | keywords | proof | theorem

]

---
## Agenda

1. arXiv statements
2. <span style='color:blue;'>Posing a classification task</span>
3. Baselines and ablations
4. Live showcase

---

## Classification setup

- A rare case of existing large-scale labeled samples (10.5m)
 - Use standard 80/20 train/test split
- Annotation of author *intention* rather than reader *description*.
- Questions:
 - which classes are linguistically separable?
 - does context aid classification?
 - does math syntax aid classification?

]
---
count:false
##  50 classes, BiLSTM baseline, F1 > 0.45
.center.boxed[
  
|           |          |       |
:--------|:----------|:---------|:------|:-----------
abstract<br> (0.95) |  | | lemma<br> (0.72) | 
acknowledgement<br> (1.00) | | | method<br> (0.49) | question<br> (0.84)
 |  |  | | related work<br> (0.63)
 |  |  |  | remark<br> (0.74)
 |  |  |  | result<br> (0.74)
  | definition<br> (0.91) | |  | 
 |  | introduction<br> (0.90) | |
 |  | | problem<br> (0.47) | 
conclusion<br> (0.73) | example<br> (0.62) | keywords<br> (0.84) | proof<br> (0.90) | theorem<br> (0.62)

]
---
class:center
## Tease apart: regroup into 13 "nests"

Data at: https://sigmathling.kwarc.info/resources/arxmliv-statements-082018/

---
## Agenda

1. arXiv statements
2. Posing a classification task
3. <span style='color:blue;'>Baselines and ablations</span>
4. Live showcase

---
# GloVe embeddings (work from 2018)

Calculate "Global vectors for word embeddings" embeddings over arXiv
 - 300 dimensional,
 - uncased, except for math syntax,
 - words with frequency 5+ in arXMLiv
 - final vocabulary of 1 million words, over 11 billion tokens
 
<br>
.footnote.left[Available at: <br>https://sigmathling.kwarc.info/resources/arxmliv-embeddings-082018/]

---
class: center
## Baselines

---
count: false
class: center
## Baselines

.footnote[
Best:

<img src='/sigmathling21/bilstm.png' width='600' />]
---
class:middle
## Ablations - Glove

.footnote.right[ [source](https://prodg.org/blog/deep_expertise/) ]
---
class:middle 
## Ablations - training data size

.footnote.right[ [source](https://prodg.org/blog/deep_expertise/) ]

---
class:middle
## Future data outlook (arXMLiv 2019)

.footnote.right[ [source](https://prodg.org/blog/arxiv_headings/) ]

---
## Agenda

1. arXiv statements
2. Posing a classification task
3. Baselines and ablations
4. <span style='color:blue;'>Live showcase</span>

<br>

<div class="center middle">
<a href="https://corpora.mathweb.org/classify_paragraph">https://corpora.mathweb.org/classify_paragraph</a>
</div>