# arXiv’s Headings : a preprocessing dive into arXMLiv 08.2019

Sep 22, 2019
LLamapun
arXiv
Preprocessing
Big Data
We explore the sectioning headings of arXiv, collecting all “standard” titles, as deposited by the authors.
2019-09-22

## Recent News

This post contains some follow-up work to our preprint introducing ”Scientific Statement Classification”, a supervised learning task over arXiv. We released

• a dataset of 10.5 million annotated paragraphs and

• just this week: a new and updated arXMLiv 08.2019 collection of HTML5 articles from the latest arXiv sources

Here, I outline how to repeat and extend the statement preprocessing task on our new data, using our homegrown llamapun toolkit.

## Overview

The goal is to regenerate the statement task dataset, using the new llamapun v0.3.3 tokenization rules, with the intention of adding additional tokens per paragraph and over 10% more paragraph volume, proporitional to the increase of data. We also want to extend the class list (previously at 50) with all high frequency heading names of arXiv, as well as figure and table captions.

1. Pre:

get llamapun up and running;

survey the headings of the new corpus release, assemble the class whitelist

extract the annotated statement resource, and organize it for redistribution

## Installing llamapun

If you do not have a Rust environment installed, consult rustup.

  $sudo apt-get install libxml2-dev$ git clone https://github.com/kwarc/llamapun
$cd llamapun$ cargo test --release


We prepare the heading collection as an example file at corpus_heading_stats.rs. Here is a simplified overview snippet:

let corpus = Corpus::new(corpus_path);
let catalog = corpus.catalog_with_parallel_walk(|document| {
/* each document is processed in its own thread from a jwalk/rayon thread pool
we keep a per-document heading frequency catalog, which will be reduced with
all other document threads into the global corpus catalog */
// reuse an XPath context for querying
let mut context = Context::new(&document.dom).unwrap();
// iterate over all headings via the provided llamapun iterator
// collect the heading’s words, with data cleaning
// word range normalziation (e.g. punctuation removal, math equation representation choices)
let word_string = data_helpers::ams_normalize_word_range(&word.range, &mut context, false);
// finally, rebuild a plaintext heading from the valid word strings
}
to maximize volume for classification
e.g. "Definition 2.4" -> "definition"; or "the proof of lemma ref" -> "proof"
This normalziation is inevitably noisy/partially wrong, as large volume brings unexpected cases.
e.g. is "introduction and related work" to be binned as "introduction", or "related work" or ignored entirely?

// once we have the heading title, count it as seen in the local thread catalog
}
// finally, return this document’s heading statistics, to be unified with the rest of the corpus
});

$echo "Takes 3.5 hours x32 threads, please enjoy a movie/walk..."$ cargo run --release --example corpus_heading_stats /data/datasets/dataset-arXMLiv-08-2019/ arxmliv_headings.csv


Running the outlined reporter over the arXMLiv 08.2019 corpus tells us that 1,374,539 documents were traversed, in which 27,948,652 individual headings were examined. 249 were discarded due to seen errors, and the process took three and a half hours on 32 logical threads.

About $22\%$ of those headings were distinct. The other $78\%$ of data volume allows us to claim there is enough repetition for certain headings to be treated as “standard” in the genre(s) of text particular to arXiv. Taking a closer look, we can break down the frequency report by magnitudes:

The headings with frequency of 101 or more are only 3,299 which is a list accessible enough to share in a single CSV file. You can find it linked as a separate gist here.

The 7 dominant heading classes that pass a million instances in 1.32 million documents are (in order of frequency): proof, lemma, theorem, references, abstract, introduction, proposition.

However, while that is indeed an expected core of a scientific article, we can notice it is heavily skewed to mathematics in particular, with only references, abstract and introduction being “general science”. An experimental article in astrophysics could share those, but would not include any lemmas for example. There could be various factors at play here. For instance, a single mathematical article could introduce tens of theorems, while a typical experimental paper could have just one or two results, as a matter of convention.

We could also interpret the numbers as a sign of success for our normalization step. I only hinted at the normalize_heading_title piece of logic, which is in essence a long list of handcrafted rules such as:

  heading if heading.starts_with("theorem") || heading.ends_with(" theorem") => "theorem",


This rule would have redirected a lot of freestyle headings such as “Main theorem” into the single “theorem” entry. Since I heuristically add the rules while studying the reports, so far they only cover about 30 common cases, and only a couple of patterns. So we can still see an entry for “sketch of proof of theorem ref” with frequency 324, even with the mentioned theorem rule. Ideally, we want a new normalization rule that maps this “sketch of proof” to “proof”. Similarly, even after normalizing, we also see “result”-like entries:

In contrast, a much more freehand case – for which we have no dedicated normalization rules111only at time of writing, more rules incoming soon – is “example”:

That makes 48 alternatives with over one hundred instances in our data, with a total volume of 46,511. And arguably, for the purpose of classifying the core statement, each alternative is a contextual variation over the same “example” class. Adding new normalization rules for these cases will thus provide an additional 15% of data to our supervised task in this class.

What I think this illustrates, is how valuable the data ground work ultimately is. Careful normalization adds additional volume and variety, which in turn leads to models that generalize better. To showcase an easy win, currently not a single “simulation” entry passes the 10,000 frequency threshold, but their combined volume222Some cases are hard: Is “simulation example” an “example” or a “simulation”? Can we be certain or should this data be excluded? is the satisfactory 34,622, making them likely candidates for inclusion in the new set:

If indeed 10,000 entries is a good threshold for reliable modeling, good normalization could be the difference between having 45 and 90 usable classes. If it turns out that 100,000 is a more appropriate data threshold for e.g. deeper models, normalization plays a slightly smaller part, as the volume barrier is hard to breach with a 15% jump in our case – only “notation” would be a viable candidate.

## Summary

So, let us wrap up the first task towards an expanded statement dataset. I can now provide a summary table, with 44 of the original 50 classes present333as originally, some classes are only available through a different selector route via AMS markup.

We can indeed see a big improvement in per-class volume, in fact a near 50% increase, from 10.5 to 15.5 million paragraphs, which is exciting! The reason for the unexpected jump becomes clear when we take into consideration how the old dataset was collected. The highlighted expert statements, such as “proof”, were collected only via the AMS markup, while the respective sectioning headings were not extracted at all. As we see here, including all sources nearly triples the size of the “proof” class, from 1 to 3 million paragraphs!

Separately, there are new avenues to consider with high volume headings that did not participate in the first release. The most promising new ones may be model, description and application. There is also more volume to be gained by adding new normalization rules and factoring in sibling categories (“summary” to “conclusion”, “background” to “preliminaries”, etc). This is tedious and repetitive work, but at the same time finite and quite valuable – we have already observed that “Data is King” with this task.

Once the whitelist is finalized, I will walk through the steps of extracting a statement dataset for each heading category and preparing it for modeling work and redistribution. More on that in Part 2 of this series, coming soon!