# arXiv’s Statements : a preprocessing dive into arXMLiv 08.2019

Oct 1, 2019
LLamapun
arXiv
Data pre-processing
Text corpus
We explore the labeled statements of arXiv, as marked up by the authors, and extract a dataset for supervised training.
2019-10-01

## Overview

This is part 2 in a blog series going through the practical steps to extracting a statement classification dataset from arXiv.org. The first part covered a tour of arXiv’s headings and you can jump into the formal task description at our paper preprint.

This post: extract the annotated statement resource, and organize it for redistribution.

## Tools

I am using our own homegrown llamapun toolkit for the data wrangling, allowing for the full preprocessing and extraction to take place in 3.5 hours on 32 logical threads, for our dataset of 1.37 million HTML5 documents. The performance is achieved via the jwalk and rayon parallel processing crates, and the generally low overhead of using Rust.

It took a new minor release 0.3.4 to get a number of small pieces in place:

• added an extra control to paragraph length: between 4 and 1024 words

• got closer to best practice on a couple of tokenization issues (related to apostrophes/possessives and formula lexemes),

• made certain XPath selectors more robust, to ensure as exhaustive as possible statement coverage,

• extended the selection scope from the 2018 release - very notably captions were added, which provided a lot of new volume,

• fine-tune heading normalization rules for extra precision, although still heuristic and prone to edge case issues,

• regenerated the token model and GloVe embeddings we release with the data,

• and naturally, it finalized the statement class whitelist, using a rough threshold of 10,000 paragraphs as a bare minimum for inclusion

## The final class list

In arXiv’s headings I ended on a cliffhanger: we did all the work to set up the tools, extract summaries from the data and spot deficiencies, yet we did not actually arrive at the final participants in the updated statement task. Which statements made the cut? After going through a full run to quantify the volume, and surveying some examples, we arrive at the following 46 categories:

 abstract acknowledgement analysis application assumption background caption case claim conclusion condition conjecture contribution corollary data dataset definition demonstration description discussion example experiment fact future work implementation introduction lemma methods model motivation notation observation preliminaries problem proof property proposition question related work remark result simulation step summary theorem theory

It may appear ironic that even though we cast a wide net over all section headings, we ended up with 46, four less than the 2018 set of 50. Yet only 26 of the original 50 classes contained more than 10,000 entries – the comparison is quite ill-posed. In fact, even though we keep enhancing the quality control measures in data collection (e.g. constrained the paragraph size), the 26 of the original high volume classes are still available to be extracted from the 2019 data, with reliably higher volume of $+10\%$ or more.111Always an exception to the rule: overview is now excluded and outshined by a much more volumous summary.

A full table with the class frequencies will be available at the end, after we pass through the extraction steps.

### The Exclusions

What did we ignore? We skip over all other 6+ million entries from our tentative report on heading volume (GitHub gist), not meeting our frequency requirement. Also ignored are thousands of low-frequency \newtheorem declarations via the amstheorem LaTeX package, for the same reason of low volume. Both data streams remain readily available in the arXMLiv 08.2019 release and can be extracted if/when needed for experiments that focus on breadth, rather than volume. I also excluded a few classes that usually do not contain narrative statements, but cover metadata or structured content, such as: references, appendix, algorithm and keywords.

Importantly, over a dozen low-volume/out-of-scope entries that were part of the officially released 2018 task definition are no longer extracted. They are:

 affirmation answer bound comment condition constraint convention criterion exercise expansion expectation explanation hint issue keywords note notice principle rule solution overview

All excluded classes may return in future versions of the data, or in other task formulations, as they are certainly valid aspects of scientific discourse.

## Extracting the dataset

We’ll walk through the highglights of the statement extraction example in llamapun.

The example takes as input a path to a corpus directory, containing HTML files generated by latexml, and a target filename for a Tar archive.

  \$ cargo run --release --example corpus_statement_paragraphs_model \
/data/datasets/dataset-arXMLiv-08-2019/ /var/local/statement_paragraphs_arxmliv_08_2019.tar


We use a single builder to write the archive, held in a thread-friendly mutex, so that threads can keep adding data to the same target archive without any race conditions.

  use tar::{Builder, Header};
struct TarBuilder {
builder: Builder<File>,
names: HashSet<String>,
}
let tar_builder = Arc::new(Mutex::new(TarBuilder { // ...


Our builder also bookkeeps a HashSet with previously seen paragraph shas, to ensure each entry added to the resource is distinct and there is no overlap between classes.

The main work is done in the parallel traversal via jwalk, yielding each document locally to a thread and executing the statement extraction code. It is implemented as part of llamapun’s parallel_data::Corpus.

  let mut corpus = Corpus::new(corpus_path);
let catalog = corpus.catalog_with_parallel_walk(|doc| {
});


Onto the extraction logic, each document is inspected as follows:

  // [skip] some document-level context variables and checks
// ‘extended_paragraph_iter‘ covers narrative paragraphs, abstracts, captions,
’paragraphs: for mut paragraph in document.extended_paragraph_iter() {
let para = paragraph.dnm.root_node; // the underlying XML node
// when the prior sibling of the paragraph is a heading title
// we ignore all other paragraphs, except for the specially marked up cases of acknowledgement and caption, e.g.
let special_marker = if para_class.contains("ltx_acknowledgement") {
Some(StructuralEnv::Acknowledgement)
} else if para_class.contains("ltx_caption") {
Some(StructuralEnv::Caption)
}
// Before we go into tokenization, ensure this is an English paragraph
if data_helpers::invalid_for_english_latin(&paragraph.dnm) {
continue ’paragraphs;
}


So far we have checked that a paragraph has special markup or is preceded by a heading, as well as it being identified as English, skipping over all others. Next, we can extract the precise label, and check it is in our whitelist of 46 classes.

    // I. Determine the class for this paragraph entry, so that we can iterate over its content after
// if no markup at all, ignore the paragraph, as we don’t have reliable classification information
let class_directory = if let Some(env) = special_marker {
// case 1: special markup for caption and acknowledgement
env.to_string()
} else {
// case 2: AMS markup + accepted AMS class
let ams_class = if has_ams_markup {
let parent_class = para_parent.get_attribute("class").unwrap_or_default();
ams::class_to_env(&parent_class)
} else {
None
};
if let Some(env) = ams_class {
match env {
// Other and other-like entities that are too noisy to include
// New for 2019: ignore the low-volume cases as well
AmsEnv::Affirmation
| AmsEnv::Algorithm
// |.. [skip] 19 other variants
| AmsEnv::Other => continue ’paragraphs,
whitelisted => whitelisted.to_string(),
}
// case 3: structural heading markup
&document.corpus.tokenizer,
&mut context,
) {
if env == StructuralEnv::Other {
// any of the other 6+ million headings that are not whitelisted, ignore
continue ’paragraphs;
}
// otherwise, any of the ‘StructuralEnv’ enum variants are accepted classes
env.to_string()
}
//... skip other cases


There is a lot going behind the scenes in this snippet. The ams::class_to_env performs a rather ambitious lookup, mapping latex-defined environment names, cleanly and reliably preserved in the HTML attributes, down to their canonical statement classes. The work behind that mapping was part of a survey that went over 20,000 of the author-provided AMS classes, retaining the ones with clear and robust intent.

The data_helpers::heading_from_node_aux hides quite a significant amount of logic as well. It uses llamapun’s “document narrative map” abstractions (DNM) to obtain a robust plain text version of the heading element, discarding e.g. tag markup for section numbers and cleanly stripping away styling information. It then performs normalization on the plain-text, reducing a shorltist of known compound headings such as “Proof of theorem ref” down to proof. Finally, this normalized heading string is mapped into a StructuralEnv struct, checking it against the whitelist we experimentally defined, recasting anything outside it as an “Other” label.

At this point, we have skipped all paragraphs without a whitelisted statement class. We have retained special markup, whitelisted AMS markup, and whitelisted structural heading markup. Thus, knowing this is a paragraph to retain, we need to normalize it to a plain text form, derived from its HTML node:

  // II. We have a labeled statement. Extract content of current paragraph, validating basic data quality
let mut word_count = 0;
let mut invalid_paragraph = false;
let mut paragraph_buffer = String::new();
’words: for word in paragraph.word_and_punct_iter() {
let word_string = match data_helpers::ams_normalize_word_range(//...
{
Ok(w) => w,
Err(_) => {
invalid_count += 1;
invalid_paragraph = true;
break ’words;
}
};
if !word_string.is_empty() {
word_count += 1;
paragraph_buffer.push_str(&word_string);
paragraph_buffer.push(’ ’);
}
}
// Discard paragraphs outside of a reasonable [4,1024] word count range
if word_count < 4 || word_count > 1024 {
invalid_count += 1;
invalid_paragraph = true;
}

// If paragraph was valid and contains text, record it
if !invalid_paragraph {

paragraph_buffer.push(’\n’);
paragraph_count += 1;
// precompute sha inside the thread, to do more in parallel
let paragraph_filename = hash_file_path(&class_directory, &paragraph_buffer);
}


This is a bit more direct. We iterate over a paragraph’s words and punctuation, and collect words for our specific use case. The ams_normalize_word_range helper allows to pass in a set of configuration options and choose whether to e.g. keep or discard math, punctuation, letter case. It also internally handles substituting the MathML representation of formulas with their sub-formula lexemes222a special feature of this dataset, provided via latexml’s tokenization of math expressions. Valid paragraphs are collected with their on-archive name prepared, for followup serialization to disk.

Lastly, having collected all appropriate paragraphs for this document, we can lock the tar builder and write the data to disk and deallocate it, keeping the RAM footprint of the traversal contained.

  // III. Record valid entries into archive target, having collected all labeled samples for this document
let mut builder_lock = tar_builder.lock().unwrap();
for (paragraph_buffer, paragraph_filename) in thread_data.into_iter() {
builder_lock
.save(&paragraph_buffer, &paragraph_filename)
.expect("Tar builder should always succeed.")
}
// IV. Bookkeep counts for final report and finish this document
}


Each statement entry is named after the SHA-256 of its contents, and is added to one of 46 subdirectories named after the statement class.

Three and a half hours later, a 40 GB tar file, containing 22.1 million statement paragraphs is ready for experimentation!

## The arXMLiv Statement Classification Dataset, 2019

While the numbers and scope in this post are still tentative, I can report a very promising look into the new extraction run over the 1.37 million arXiv articles, upto 08.2019.

With 22.1 million paragraphs collected, from a total of 97.6 million, as defined by our “extended paragraph” iterator, we are retaining 22% from the total paragraph volume available in the dataset. As a very loose estimate, given that our embeddings statistitcs show $\approx 15.2$ billion tokens from all paragraphs, then we could estimate this statement set contains $\approx 3$ billion tokens.

Thus, for our 46 classes of choice, the distinct paragraphs extracted are ranked as follows:

For convenience, here is the same table in alphabetical order:

## Note on Reuse

The final tar file is at its worst containing 7 million files in a single subdirectory. Using that setup unpacked can outright lead to errors with your local filesystem, or lead to extreme slowness in operations that were not written with large directories in mind. So instead, the tar is best used by walking it directly, and re-mapping the data into another resource. Commonly, I would use the word embeddings to map each token string to its embedding id, do the same for the class and label id, and transfer that now model-specific data to an HDF5 file, ready to be used in a Jupyter notebook workflow.

I haven’t made this new statement set public yet, but certainly intend to do so shortly, after another couple of integrity checks. Until then I warmly recommend getting started with the 2018 statement classification set if modeling scientific discourse spikes your interest!