arXiv’s Headings : a preprocessing dive into arXMLiv 08.2019
Recent News
This post contains some follow-up work to our preprint introducing ”Scientific Statement Classification”, a supervised learning task over arXiv. We released
- •
a dataset of 10.5 million annotated paragraphs and
- •
just this week: a new and updated arXMLiv 08.2019 collection of HTML5 articles from the latest arXiv sources
Here, I outline how to repeat and extend the statement preprocessing task on our new data, using our homegrown llamapun toolkit.
Overview
The goal is to regenerate the statement task dataset, using the new llamapun v0.3.3 tokenization rules, with the intention of adding additional tokens per paragraph and over 10% more paragraph volume, proporitional to the increase of data. We also want to extend the class list (previously at 50) with all high frequency heading names of arXiv, as well as figure and table captions.
- Pre:
get llamapun up and running;
- Task 1:
survey the headings of the new corpus release, assemble the class whitelist
- Task 2:
extract the annotated statement resource, and organize it for redistribution
Installing llamapun
If you do not have a Rust environment installed, consult rustup.
$ sudo apt-get install libxml2-dev $ git clone https://github.com/kwarc/llamapun $ cd llamapun $ cargo test --release
Task 1: Survey arXiv’s headings
We prepare the heading collection as an example file at corpus_heading_stats.rs. Here is a simplified overview snippet:
let corpus = Corpus::new(corpus_path); let catalog = corpus.catalog_with_parallel_walk(|document| { /* each document is processed in its own thread from a jwalk/rayon thread pool we keep a per-document heading frequency catalog, which will be reduced with all other document threads into the global corpus catalog */ let mut thread_counts = HashMap::new(); // reuse an XPath context for querying let mut context = Context::new(&document.dom).unwrap(); // iterate over all headings via the provided llamapun iterator ’headings: for mut heading in document.heading_iter() { let mut heading_buffer = String::new(); // collect the heading’s words, with data cleaning for word in heading.word_iter() { // word range normalziation (e.g. punctuation removal, math equation representation choices) let word_string = data_helpers::ams_normalize_word_range(&word.range, &mut context, false); // finally, rebuild a plaintext heading from the valid word strings heading_buffer.push_str(&word_string); heading_buffer.push(’ ’); } /* specific to heading task: normalize similar headings into canonical classes, to maximize volume for classification e.g. "Definition 2.4" -> "definition"; or "the proof of lemma ref" -> "proof" This normalziation is inevitably noisy/partially wrong, as large volume brings unexpected cases. e.g. is "introduction and related work" to be binned as "introduction", or "related work" or ignored entirely? answer is inevitably pragmatic and carries trade-offs. */ heading_buffer = data_helpers::normalize_heading_title(&heading_buffer); // once we have the heading title, count it as seen in the local thread catalog let this_heading_counter = thread_counts.entry(heading_buffer).or_insert(0); *this_heading_counter += 1; } // finally, return this document’s heading statistics, to be unified with the rest of the corpus thread_counts });
$ echo "Takes 3.5 hours x32 threads, please enjoy a movie/walk..." $ cargo run --release --example corpus_heading_stats /data/datasets/dataset-arXMLiv-08-2019/ arxmliv_headings.csv
Running the outlined reporter over the arXMLiv 08.2019 corpus tells us that 1,374,539 documents were traversed, in which 27,948,652 individual headings were examined. 249 were discarded due to seen errors, and the process took three and a half hours on 32 logical threads.
About of those headings were distinct. The other of data volume allows us to claim there is enough repetition for certain headings to be treated as “standard” in the genre(s) of text particular to arXiv. Taking a closer look, we can break down the frequency report by magnitudes:
corpus frequency | distinct headings | example heading title |
---|---|---|
7 | proof | |
22 | problem | |
44 | case | |
239 | protocol | |
3,299 | limitation | |
42,286 | threats to external validity | |
6,374,003 | heegaard genus of small manifolds |
The headings with frequency of 101 or more are only 3,299 which is a list accessible enough to share in a single CSV file. You can find it linked as a separate gist here.
The 7 dominant heading classes that pass a million instances in 1.32 million documents are (in order of frequency): proof, lemma, theorem, references, abstract, introduction, proposition.
However, while that is indeed an expected core of a scientific article, we can notice it is heavily skewed to mathematics in particular, with only references, abstract and introduction being “general science”. An experimental article in astrophysics could share those, but would not include any lemmas for example. There could be various factors at play here. For instance, a single mathematical article could introduce tens of theorems, while a typical experimental paper could have just one or two results, as a matter of convention.
We could also interpret the numbers as a sign of success for our normalization step. I only hinted at the normalize_heading_title piece of logic, which is in essence a long list of handcrafted rules such as:
heading if heading.starts_with("theorem") || heading.ends_with(" theorem") => "theorem",
This rule would have redirected a lot of freestyle headings such as “Main theorem” into the single “theorem” entry. Since I heuristically add the rules while studying the reports, so far they only cover about 30 common cases, and only a couple of patterns. So we can still see an entry for “sketch of proof of theorem ref” with frequency 324, even with the mentioned theorem rule. Ideally, we want a new normalization rule that maps this “sketch of proof” to “proof”. Similarly, even after normalizing, we also see “result”-like entries:
alternatives | frequency |
---|---|
result | 420189 |
simulation results and analysis | 133 |
numerical results and analysis | 112 |
In contrast, a much more freehand case – for which we have no dedicated normalization rules111only at time of writing, more rules incoming soon – is “example”:
alternatives | frequency |
example | 302,048 |
examples | 26,911 |
numerical examples | 4,363 |
some examples | 1,233 |
illustrative example | 1,055 |
example citationelement | 1,041 |
motivating example | 790 |
counterexample | 782 |
illustrative examples | 724 |
simple example | 671 |
examples and application | 524 |
counterexamples | 379 |
further examples | 377 |
example continued | 364 |
two examples | 356 |
other examples | 345 |
applications and examples | 339 |
first example | 323 |
toy example | 319 |
numerical example | 306 |
motivating examples | 303 |
example example ref continued | 296 |
running example | 275 |
example mathformula | 247 |
explicit example | 247 |
example application | 245 |
explicit examples | 243 |
basic examples | 240 |
counter example | 236 |
simple examples | 226 |
second example | 225 |
real data example | 211 |
more examples | 201 |
simulation examples | 157 |
application examples | 155 |
simulation example | 149 |
examples of application | 143 |
real data examples | 142 |
example ref continued | 136 |
specific examples | 129 |
example continuation of example ref | 126 |
example cont | 119 |
example see citationelement | 117 |
example ref | 114 |
concrete example | 111 |
first examples | 106 |
worked example | 106 |
example italic_n RELOP_equals NUM | 102 |
examples and counterexamples | 101 |
three examples | 101 |
That makes 48 alternatives with over one hundred instances in our data, with a total volume of 46,511. And arguably, for the purpose of classifying the core statement, each alternative is a contextual variation over the same “example” class. Adding new normalization rules for these cases will thus provide an additional 15% of data to our supervised task in this class.
What I think this illustrates, is how valuable the data ground work ultimately is. Careful normalization adds additional volume and variety, which in turn leads to models that generalize better. To showcase an easy win, currently not a single “simulation” entry passes the 10,000 frequency threshold, but their combined volume222Some cases are hard: Is “simulation example” an “example” or a “simulation”? Can we be certain or should this data be excluded? is the satisfactory 34,622, making them likely candidates for inclusion in the new set:
alternatives | frequency |
simulations | 9,692 |
numerical simulations | 5,087 |
simulation | 3,887 |
simulation study | 2,598 |
simulation setup | 1,827 |
monte carlo simulations | 1,590 |
simulation details | 1,526 |
numerical simulation | 1,250 |
simulation studies | 1,180 |
monte carlo simulation | 1,115 |
simulation parameters | 569 |
body simulations | 294 |
comparison with simulations | 271 |
molecular dynamics simulations | 264 |
simulation settings | 241 |
simulation set up | 235 |
simulation procedure | 207 |
computer simulations | 191 |
comparison with numerical simulations | 186 |
italic_N body simulations | 184 |
simulation algorithm | 158 |
hydrodynamical simulations | 158 |
simulation design | 157 |
simulation examples | 157 |
cosmological simulations | 153 |
simulation example | 149 |
detector simulation | 149 |
event simulation | 148 |
simulation methodology | 143 |
simulation results and analysis | 133 |
hydrodynamic simulations | 128 |
simulation data | 125 |
simulation framework | 123 |
simulation setting | 121 |
simulation code | 114 |
simulation environment | 112 |
If indeed 10,000 entries is a good threshold for reliable modeling, good normalization could be the difference between having 45 and 90 usable classes. If it turns out that 100,000 is a more appropriate data threshold for e.g. deeper models, normalization plays a slightly smaller part, as the volume barrier is hard to breach with a 15% jump in our case – only “notation” would be a viable candidate.
Summary
So, let us wrap up the first task towards an expanded statement dataset. I can now provide a summary table, with 44 of the original 50 classes present333as originally, some classes are only available through a different selector route via AMS markup.
Heading | Frequency |
---|---|
proof | 2,930,621 |
lemma | 1,706,821 |
theorem | 1,700,430 |
abstract | 1,193,933 |
introduction | 1,117,555 |
proposition | 1,059,776 |
definition | 972,999 |
remark | 888,243 |
acknowledgement | 600,981 |
conclusion | 586,157 |
corollary | 553,531 |
result | 420,189 |
example | 302,048 |
discussion | 207,109 |
keywords | 194,630 |
method | 160,941 |
experiment | 159,707 |
problem | 141,774 |
notation | 87,040 |
observation | 79,453 |
summary | 77,628 |
conjecture | 63,173 |
claim | 61,609 |
related work | 58,824 |
assumption | 37,132 |
demonstration | 36,426 |
question | 33,697 |
fact | 19,169 |
step | 9,704 |
overview | 9,034 |
exercise | 7,207 |
note | 5,387 |
condition | 4,563 |
convention | 2,887 |
solution | 1,729 |
constraints | 1,279 |
principle | 598 |
rule | 542 |
criterion | 465 |
issue | 391 |
expansion | 249 |
explanation | 153 |
expectation | 128 |
answer | 121 |
Total | 15,496,033 |
We can indeed see a big improvement in per-class volume, in fact a near 50% increase, from 10.5 to 15.5 million paragraphs, which is exciting! The reason for the unexpected jump becomes clear when we take into consideration how the old dataset was collected. The highlighted expert statements, such as “proof”, were collected only via the AMS markup, while the respective sectioning headings were not extracted at all. As we see here, including all sources nearly triples the size of the “proof” class, from 1 to 3 million paragraphs!
Separately, there are new avenues to consider with high volume headings that did not participate in the first release. The most promising new ones may be model, description and application. There is also more volume to be gained by adding new normalization rules and factoring in sibling categories (“summary” to “conclusion”, “background” to “preliminaries”, etc). This is tedious and repetitive work, but at the same time finite and quite valuable – we have already observed that “Data is King” with this task.
Once the whitelist is finalized, I will walk through the steps of extracting a statement dataset for each heading category and preparing it for modeling work and redistribution. More on that in Part 2 of this series, coming soon!