arXiv’s Headings : a preprocessing dive into arXMLiv 08.2019

Sep 22, 2019

LLamapun

arXiv

Preprocessing

Big Data

https://creativecommons.org/licenses/by/4.0/

We explore the sectioning headings of arXiv, collecting all “standard” titles, as deposited by the authors.

2019-09-22

Overview

The goal is to regenerate the statement task dataset, using the new llamapun v0.3.3 tokenization rules, with the intention of adding additional tokens per paragraph and over 10% more paragraph volume, proporitional to the increase of data. We also want to extend the class list (previously at 50) with all high frequency heading names of arXiv, as well as figure and table captions.

Pre:

get llamapun up and running;
Task 1:

survey the headings of the new corpus release, assemble the class whitelist
Task 2:

extract the annotated statement resource, and organize it for redistribution

Installing llamapun

If you do not have a Rust environment installed, consult rustup.

  $ sudo apt-get install libxml2-dev
  $ git clone https://github.com/kwarc/llamapun
  $ cd llamapun
  $ cargo test --release

Task 1: Survey arXiv’s headings

We prepare the heading collection as an example file at corpus_heading_stats.rs. Here is a simplified overview snippet:

let corpus = Corpus::new(corpus_path);
let catalog = corpus.catalog_with_parallel_walk(|document| {
  /* each document is processed in its own thread from a jwalk/rayon thread pool
     we keep a per-document heading frequency catalog, which will be reduced with
     all other document threads into the global corpus catalog */
  let mut thread_counts = HashMap::new();
  // reuse an XPath context for querying
  let mut context = Context::new(&document.dom).unwrap();
  // iterate over all headings via the provided llamapun iterator
  ’headings: for mut heading in document.heading_iter() {
    let mut heading_buffer = String::new();
    // collect the heading’s words, with data cleaning
    for word in heading.word_iter() {
      // word range normalziation (e.g. punctuation removal, math equation representation choices)
      let word_string = data_helpers::ams_normalize_word_range(&word.range, &mut context, false);
      // finally, rebuild a plaintext heading from the valid word strings
      heading_buffer.push_str(&word_string);
      heading_buffer.push(’ ’);
    }
    /* specific to heading task: normalize similar headings into canonical classes,
       to maximize volume for classification
       e.g. "Definition 2.4" -> "definition"; or "the proof of lemma ref" -> "proof"
       This normalziation is inevitably noisy/partially wrong, as large volume brings unexpected cases.
       e.g. is "introduction and related work" to be binned as "introduction", or "related work" or ignored entirely?
       answer is inevitably pragmatic and carries trade-offs. */
    heading_buffer = data_helpers::normalize_heading_title(&heading_buffer);

    // once we have the heading title, count it as seen in the local thread catalog
    let this_heading_counter = thread_counts.entry(heading_buffer).or_insert(0);
    *this_heading_counter += 1;
  }
  // finally, return this document’s heading statistics, to be unified with the rest of the corpus
  thread_counts
});

$ echo "Takes 3.5 hours x32 threads, please enjoy a movie/walk..."
$ cargo run --release --example corpus_heading_stats /data/datasets/dataset-arXMLiv-08-2019/ arxmliv_headings.csv

Running the outlined reporter over the arXMLiv 08.2019 corpus tells us that 1,374,539 documents were traversed, in which 27,948,652 individual headings were examined. 249 were discarded due to seen errors, and the process took three and a half hours on 32 logical threads.

About $22\%$ of those headings were distinct. The other $78\%$ of data volume allows us to claim there is enough repetition for certain headings to be treated as “standard” in the genre(s) of text particular to arXiv. Taking a closer look, we can break down the frequency report by magnitudes:

corpus frequency	distinct headings	example heading title
$>1,000,000$	7	proof
$>100,000$	22	problem
$>10,000$	44	case
$>1,000$	239	protocol
$>100$	3,299	limitation
$>10$	42,286	threats to external validity
$>0$	6,374,003	heegaard genus of small manifolds

Table 1: Distinct headings per frequency magnitude

The headings with frequency of 101 or more are only 3,299 which is a list accessible enough to share in a single CSV file. You can find it linked as a separate gist here.

The 7 dominant heading classes that pass a million instances in 1.32 million documents are (in order of frequency): proof, lemma, theorem, references, abstract, introduction, proposition.

However, while that is indeed an expected core of a scientific article, we can notice it is heavily skewed to mathematics in particular, with only references, abstract and introduction being “general science”. An experimental article in astrophysics could share those, but would not include any lemmas for example. There could be various factors at play here. For instance, a single mathematical article could introduce tens of theorems, while a typical experimental paper could have just one or two results, as a matter of convention.

We could also interpret the numbers as a sign of success for our normalization step. I only hinted at the normalize_heading_title piece of logic, which is in essence a long list of handcrafted rules such as:

  heading if heading.starts_with("theorem") || heading.ends_with(" theorem") => "theorem",

This rule would have redirected a lot of freestyle headings such as “Main theorem” into the single “theorem” entry. Since I heuristically add the rules while studying the reports, so far they only cover about 30 common cases, and only a couple of patterns. So we can still see an entry for “sketch of proof of theorem ref” with frequency 324, even with the mentioned theorem rule. Ideally, we want a new normalization rule that maps this “sketch of proof” to “proof”. Similarly, even after normalizing, we also see “result”-like entries:

alternatives	frequency
result	420189
simulation results and analysis	133
numerical results and analysis	112

Table 2: Frequent(

>100

) alternatives for “result”

In contrast, a much more freehand case – for which we have no dedicated normalization rules¹¹1only at time of writing, more rules incoming soon – is “example”:

alternatives	frequency
example	302,048
examples	26,911
numerical examples	4,363
some examples	1,233
illustrative example	1,055
example citationelement	1,041
motivating example	790
counterexample	782
illustrative examples	724
simple example	671
examples and application	524
counterexamples	379
further examples	377
example continued	364
two examples	356
other examples	345
applications and examples	339
first example	323
toy example	319
numerical example	306
motivating examples	303
example example ref continued	296
running example	275
example mathformula	247
explicit example	247
example application	245
explicit examples	243
basic examples	240
counter example	236
simple examples	226
second example	225
real data example	211
more examples	201
simulation examples	157
application examples	155
simulation example	149
examples of application	143
real data examples	142
example ref continued	136
specific examples	129
example continuation of example ref	126
example cont	119
example see citationelement	117
example ref	114
concrete example	111
first examples	106
worked example	106
example italic_n RELOP_equals NUM	102
examples and counterexamples	101
three examples	101

Table 3: Frequent(

>100

) alternatives for “example”

That makes 48 alternatives with over one hundred instances in our data, with a total volume of 46,511. And arguably, for the purpose of classifying the core statement, each alternative is a contextual variation over the same “example” class. Adding new normalization rules for these cases will thus provide an additional 15% of data to our supervised task in this class.

What I think this illustrates, is how valuable the data ground work ultimately is. Careful normalization adds additional volume and variety, which in turn leads to models that generalize better. To showcase an easy win, currently not a single “simulation” entry passes the 10,000 frequency threshold, but their combined volume²²2Some cases are hard: Is “simulation example” an “example” or a “simulation”? Can we be certain or should this data be excluded? is the satisfactory 34,622, making them likely candidates for inclusion in the new set:

alternatives	frequency
simulations	9,692
numerical simulations	5,087
simulation	3,887
simulation study	2,598
simulation setup	1,827
monte carlo simulations	1,590
simulation details	1,526
numerical simulation	1,250
simulation studies	1,180
monte carlo simulation	1,115
simulation parameters	569
body simulations	294
comparison with simulations	271
molecular dynamics simulations	264
simulation settings	241
simulation set up	235
simulation procedure	207
computer simulations	191
comparison with numerical simulations	186
italic_N body simulations	184
simulation algorithm	158
hydrodynamical simulations	158
simulation design	157
simulation examples	157
cosmological simulations	153
simulation example	149
detector simulation	149
event simulation	148
simulation methodology	143
simulation results and analysis	133
hydrodynamic simulations	128
simulation data	125
simulation framework	123
simulation setting	121
simulation code	114
simulation environment	112

Table 4: Frequent(

>100

) alternatives for “simulation”

If indeed 10,000 entries is a good threshold for reliable modeling, good normalization could be the difference between having 45 and 90 usable classes. If it turns out that 100,000 is a more appropriate data threshold for e.g. deeper models, normalization plays a slightly smaller part, as the volume barrier is hard to breach with a 15% jump in our case – only “notation” would be a viable candidate.

Summary

So, let us wrap up the first task towards an expanded statement dataset. I can now provide a summary table, with 44 of the original 50 classes present³³3as originally, some classes are only available through a different selector route via AMS markup.

Heading	Frequency
proof	2,930,621
lemma	1,706,821
theorem	1,700,430
abstract	1,193,933
introduction	1,117,555
proposition	1,059,776
definition	972,999
remark	888,243
acknowledgement	600,981
conclusion	586,157
corollary	553,531
result	420,189
example	302,048
discussion	207,109
keywords	194,630
method	160,941
experiment	159,707
problem	141,774
notation	87,040
observation	79,453
summary	77,628
conjecture	63,173
claim	61,609
related work	58,824
assumption	37,132
demonstration	36,426
question	33,697
fact	19,169
step	9,704
overview	9,034
exercise	7,207
note	5,387
condition	4,563
convention	2,887
solution	1,729
constraints	1,279
principle	598
rule	542
criterion	465
issue	391
expansion	249
explanation	153
expectation	128
answer	121
Total	15,496,033

Table 5: Frequencies (partial) of the 50-class task classes, in arXMLiv 08.2019

We can indeed see a big improvement in per-class volume, in fact a near 50% increase, from 10.5 to 15.5 million paragraphs, which is exciting! The reason for the unexpected jump becomes clear when we take into consideration how the old dataset was collected. The highlighted expert statements, such as “proof”, were collected only via the AMS markup, while the respective sectioning headings were not extracted at all. As we see here, including all sources nearly triples the size of the “proof” class, from 1 to 3 million paragraphs!

Separately, there are new avenues to consider with high volume headings that did not participate in the first release. The most promising new ones may be model, description and application. There is also more volume to be gained by adding new normalization rules and factoring in sibling categories (“summary” to “conclusion”, “background” to “preliminaries”, etc). This is tedious and repetitive work, but at the same time finite and quite valuable – we have already observed that “Data is King” with this task.

Once the whitelist is finalized, I will walk through the steps of extracting a statement dataset for each heading category and preparing it for modeling work and redistribution. More on that in Part 2 of this series, coming soon!

arXiv’s Headings : a preprocessing dive into arXMLiv 08.2019

Recent News

Overview

Installing llamapun

Task 1: Survey arXiv’s headings

Summary