arXiv’s Headings : a preprocessing dive into arXMLiv 08.2019

Sep 22, 2019
LLamapun
arXiv
Preprocessing
Big Data
https://creativecommons.org/licenses/by/4.0/
We explore the sectioning headings of arXiv, collecting all “standard” titles, as deposited by the authors.
2019-09-22

Recent News

This post contains some follow-up work to our preprint introducing ”Scientific Statement Classification”, a supervised learning task over arXiv. We released

  • a dataset of 10.5 million annotated paragraphs and

  • just this week: a new and updated arXMLiv 08.2019 collection of HTML5 articles from the latest arXiv sources

Here, I outline how to repeat and extend the statement preprocessing task on our new data, using our homegrown llamapun toolkit.

Overview

The goal is to regenerate the statement task dataset, using the new llamapun v0.3.3 tokenization rules, with the intention of adding additional tokens per paragraph and over 10% more paragraph volume, proporitional to the increase of data. We also want to extend the class list (previously at 50) with all high frequency heading names of arXiv, as well as figure and table captions.

  1. Pre:

    get llamapun up and running;

  2. Task 1:

    survey the headings of the new corpus release, assemble the class whitelist

  3. Task 2:

    extract the annotated statement resource, and organize it for redistribution

Installing llamapun

If you do not have a Rust environment installed, consult rustup.

  $ sudo apt-get install libxml2-dev
  $ git clone https://github.com/kwarc/llamapun
  $ cd llamapun
  $ cargo test --release

Task 1: Survey arXiv’s headings

We prepare the heading collection as an example file at corpus_heading_stats.rs. Here is a simplified overview snippet:

let corpus = Corpus::new(corpus_path);
let catalog = corpus.catalog_with_parallel_walk(|document| {
  /* each document is processed in its own thread from a jwalk/rayon thread pool
     we keep a per-document heading frequency catalog, which will be reduced with
     all other document threads into the global corpus catalog */
  let mut thread_counts = HashMap::new();
  // reuse an XPath context for querying
  let mut context = Context::new(&document.dom).unwrap();
  // iterate over all headings via the provided llamapun iterator
  ’headings: for mut heading in document.heading_iter() {
    let mut heading_buffer = String::new();
    // collect the heading’s words, with data cleaning
    for word in heading.word_iter() {
      // word range normalziation (e.g. punctuation removal, math equation representation choices)
      let word_string = data_helpers::ams_normalize_word_range(&word.range, &mut context, false);
      // finally, rebuild a plaintext heading from the valid word strings
      heading_buffer.push_str(&word_string);
      heading_buffer.push(’ ’);
    }
    /* specific to heading task: normalize similar headings into canonical classes,
       to maximize volume for classification
       e.g. "Definition 2.4" -> "definition"; or "the proof of lemma ref" -> "proof"
       This normalziation is inevitably noisy/partially wrong, as large volume brings unexpected cases.
       e.g. is "introduction and related work" to be binned as "introduction", or "related work" or ignored entirely?
       answer is inevitably pragmatic and carries trade-offs. */
    heading_buffer = data_helpers::normalize_heading_title(&heading_buffer);

    // once we have the heading title, count it as seen in the local thread catalog
    let this_heading_counter = thread_counts.entry(heading_buffer).or_insert(0);
    *this_heading_counter += 1;
  }
  // finally, return this document’s heading statistics, to be unified with the rest of the corpus
  thread_counts
});
$ echo "Takes 3.5 hours x32 threads, please enjoy a movie/walk..."
$ cargo run --release --example corpus_heading_stats /data/datasets/dataset-arXMLiv-08-2019/ arxmliv_headings.csv

Running the outlined reporter over the arXMLiv 08.2019 corpus tells us that 1,374,539 documents were traversed, in which 27,948,652 individual headings were examined. 249 were discarded due to seen errors, and the process took three and a half hours on 32 logical threads.

About 22%percent22 of those headings were distinct. The other 78%percent78 of data volume allows us to claim there is enough repetition for certain headings to be treated as “standard” in the genre(s) of text particular to arXiv. Taking a closer look, we can break down the frequency report by magnitudes:

corpus frequency distinct headings example heading title
>1,000,000absent1000000 7 proof
>100,000absent100000 22 problem
>10,000absent10000 44 case
>1,000absent1000 239 protocol
>100absent100 3,299 limitation
>10absent10 42,286 threats to external validity
>0absent0 6,374,003 heegaard genus of small manifolds
Table 1: Distinct headings per frequency magnitude

The headings with frequency of 101 or more are only 3,299 which is a list accessible enough to share in a single CSV file. You can find it linked as a separate gist here.

The 7 dominant heading classes that pass a million instances in 1.32 million documents are (in order of frequency): proof, lemma, theorem, references, abstract, introduction, proposition.

However, while that is indeed an expected core of a scientific article, we can notice it is heavily skewed to mathematics in particular, with only references, abstract and introduction being “general science”. An experimental article in astrophysics could share those, but would not include any lemmas for example. There could be various factors at play here. For instance, a single mathematical article could introduce tens of theorems, while a typical experimental paper could have just one or two results, as a matter of convention.

We could also interpret the numbers as a sign of success for our normalization step. I only hinted at the normalize_heading_title piece of logic, which is in essence a long list of handcrafted rules such as:

  heading if heading.starts_with("theorem") || heading.ends_with(" theorem") => "theorem",

This rule would have redirected a lot of freestyle headings such as “Main theorem” into the single “theorem” entry. Since I heuristically add the rules while studying the reports, so far they only cover about 30 common cases, and only a couple of patterns. So we can still see an entry for “sketch of proof of theorem ref” with frequency 324, even with the mentioned theorem rule. Ideally, we want a new normalization rule that maps this “sketch of proof” to “proof”. Similarly, even after normalizing, we also see “result”-like entries:

alternatives frequency
result 420189
simulation results and analysis 133
numerical results and analysis 112
Table 2: Frequent(>100absent100) alternatives for “result”

In contrast, a much more freehand case – for which we have no dedicated normalization rules111only at time of writing, more rules incoming soon – is “example”:

alternatives frequency
example 302,048
examples 26,911
numerical examples 4,363
some examples 1,233
illustrative example 1,055
example citationelement 1,041
motivating example 790
counterexample 782
illustrative examples 724
simple example 671
examples and application 524
counterexamples 379
further examples 377
example continued 364
two examples 356
other examples 345
applications and examples 339
first example 323
toy example 319
numerical example 306
motivating examples 303
example example ref continued 296
running example 275
example mathformula 247
explicit example 247
example application 245
explicit examples 243
basic examples 240
counter example 236
simple examples 226
second example 225
real data example 211
more examples 201
simulation examples 157
application examples 155
simulation example 149
examples of application 143
real data examples 142
example ref continued 136
specific examples 129
example continuation of example ref 126
example cont 119
example see citationelement 117
example ref 114
concrete example 111
first examples 106
worked example 106
example italic_n RELOP_equals NUM 102
examples and counterexamples 101
three examples 101
Table 3: Frequent(>100absent100) alternatives for “example”

That makes 48 alternatives with over one hundred instances in our data, with a total volume of 46,511. And arguably, for the purpose of classifying the core statement, each alternative is a contextual variation over the same “example” class. Adding new normalization rules for these cases will thus provide an additional 15% of data to our supervised task in this class.

What I think this illustrates, is how valuable the data ground work ultimately is. Careful normalization adds additional volume and variety, which in turn leads to models that generalize better. To showcase an easy win, currently not a single “simulation” entry passes the 10,000 frequency threshold, but their combined volume222Some cases are hard: Is “simulation example” an “example” or a “simulation”? Can we be certain or should this data be excluded? is the satisfactory 34,622, making them likely candidates for inclusion in the new set:

alternatives frequency
simulations 9,692
numerical simulations 5,087
simulation 3,887
simulation study 2,598
simulation setup 1,827
monte carlo simulations 1,590
simulation details 1,526
numerical simulation 1,250
simulation studies 1,180
monte carlo simulation 1,115
simulation parameters 569
body simulations 294
comparison with simulations 271
molecular dynamics simulations 264
simulation settings 241
simulation set up 235
simulation procedure 207
computer simulations 191
comparison with numerical simulations 186
italic_N body simulations 184
simulation algorithm 158
hydrodynamical simulations 158
simulation design 157
simulation examples 157
cosmological simulations 153
simulation example 149
detector simulation 149
event simulation 148
simulation methodology 143
simulation results and analysis 133
hydrodynamic simulations 128
simulation data 125
simulation framework 123
simulation setting 121
simulation code 114
simulation environment 112
Table 4: Frequent(>100absent100) alternatives for “simulation”

If indeed 10,000 entries is a good threshold for reliable modeling, good normalization could be the difference between having 45 and 90 usable classes. If it turns out that 100,000 is a more appropriate data threshold for e.g. deeper models, normalization plays a slightly smaller part, as the volume barrier is hard to breach with a 15% jump in our case – only “notation” would be a viable candidate.

Summary

So, let us wrap up the first task towards an expanded statement dataset. I can now provide a summary table, with 44 of the original 50 classes present333as originally, some classes are only available through a different selector route via AMS markup.

Heading Frequency
proof 2,930,621
lemma 1,706,821
theorem 1,700,430
abstract 1,193,933
introduction 1,117,555
proposition 1,059,776
definition 972,999
remark 888,243
acknowledgement 600,981
conclusion 586,157
corollary 553,531
result 420,189
example 302,048
discussion 207,109
keywords 194,630
method 160,941
experiment 159,707
problem 141,774
notation 87,040
observation 79,453
summary 77,628
conjecture 63,173
claim 61,609
related work 58,824
assumption 37,132
demonstration 36,426
question 33,697
fact 19,169
step 9,704
overview 9,034
exercise 7,207
note 5,387
condition 4,563
convention 2,887
solution 1,729
constraints 1,279
principle 598
rule 542
criterion 465
issue 391
expansion 249
explanation 153
expectation 128
answer 121
Total 15,496,033
Table 5: Frequencies (partial) of the 50-class task classes, in arXMLiv 08.2019

We can indeed see a big improvement in per-class volume, in fact a near 50% increase, from 10.5 to 15.5 million paragraphs, which is exciting! The reason for the unexpected jump becomes clear when we take into consideration how the old dataset was collected. The highlighted expert statements, such as “proof”, were collected only via the AMS markup, while the respective sectioning headings were not extracted at all. As we see here, including all sources nearly triples the size of the “proof” class, from 1 to 3 million paragraphs!

Separately, there are new avenues to consider with high volume headings that did not participate in the first release. The most promising new ones may be model, description and application. There is also more volume to be gained by adding new normalization rules and factoring in sibling categories (“summary” to “conclusion”, “background” to “preliminaries”, etc). This is tedious and repetitive work, but at the same time finite and quite valuable – we have already observed that “Data is King” with this task.

Once the whitelist is finalized, I will walk through the steps of extracting a statement dataset for each heading category and preparing it for modeling work and redistribution. More on that in Part 2 of this series, coming soon!

  LLamapun ,   arXiv ,   Preprocessing ,   Big Data