arXiv’s Statements : a preprocessing dive into arXMLiv 08.2019

Oct 1, 2019

LLamapun

arXiv

Data pre-processing

Text corpus

https://creativecommons.org/licenses/by/4.0/

We explore the labeled statements of arXiv, as marked up by the authors, and extract a dataset for supervised training.

2019-10-01

Overview

This is part 2 in a blog series going through the practical steps to extracting a statement classification dataset from arXiv.org. The first part covered a tour of arXiv’s headings and you can jump into the formal task description at our paper preprint.

This post: extract the annotated statement resource, and organize it for redistribution.

Tools

I am using our own homegrown llamapun toolkit for the data wrangling, allowing for the full preprocessing and extraction to take place in 3.5 hours on 32 logical threads, for our dataset of 1.37 million HTML5 documents. The performance is achieved via the jwalk and rayon parallel processing crates, and the generally low overhead of using Rust.

It took a new minor release 0.3.4 to get a number of small pieces in place:

•

added an extra control to paragraph length: between 4 and 1024 words
•

got closer to best practice on a couple of tokenization issues (related to apostrophes/possessives and formula lexemes),
•

made certain XPath selectors more robust, to ensure as exhaustive as possible statement coverage,
•

extended the selection scope from the 2018 release - very notably captions were added, which provided a lot of new volume,
•

fine-tune heading normalization rules for extra precision, although still heuristic and prone to edge case issues,
•

regenerated the token model and GloVe embeddings we release with the data,
•

and naturally, it finalized the statement class whitelist, using a rough threshold of 10,000 paragraphs as a bare minimum for inclusion

The final class list

In arXiv’s headings I ended on a cliffhanger: we did all the work to set up the tools, extract summaries from the data and spot deficiencies, yet we did not actually arrive at the final participants in the updated statement task. Which statements made the cut? After going through a full run to quantify the volume, and surveying some examples, we arrive at the following 46 categories:

abstract	acknowledgement	analysis	application	assumption	background	caption
case	claim	conclusion	condition	conjecture	contribution	corollary
data	dataset	definition	demonstration	description	discussion	example
experiment	fact	future work	implementation	introduction	lemma	methods
model	motivation	notation	observation	preliminaries	problem	proof
property	proposition	question	related work	remark	result	simulation
step	summary	theorem	theory

It may appear ironic that even though we cast a wide net over all section headings, we ended up with 46, four less than the 2018 set of 50. Yet only 26 of the original 50 classes contained more than 10,000 entries – the comparison is quite ill-posed. In fact, even though we keep enhancing the quality control measures in data collection (e.g. constrained the paragraph size), the 26 of the original high volume classes are still available to be extracted from the 2019 data, with reliably higher volume of $+10\%$ or more.¹¹1Always an exception to the rule: overview is now excluded and outshined by a much more volumous summary.

A full table with the class frequencies will be available at the end, after we pass through the extraction steps.

The Exclusions

What did we ignore? We skip over all other 6+ million entries from our tentative report on heading volume (GitHub gist), not meeting our frequency requirement. Also ignored are thousands of low-frequency \newtheorem declarations via the amstheorem LaTeX package, for the same reason of low volume. Both data streams remain readily available in the arXMLiv 08.2019 release and can be extracted if/when needed for experiments that focus on breadth, rather than volume. I also excluded a few classes that usually do not contain narrative statements, but cover metadata or structured content, such as: references, appendix, algorithm and keywords.

Importantly, over a dozen low-volume/out-of-scope entries that were part of the officially released 2018 task definition are no longer extracted. They are:

affirmation	answer	bound	comment	condition	constraint	convention	criterion
exercise	expansion	expectation	explanation	hint	issue	keywords	note
notice	principle	rule	solution	overview

All excluded classes may return in future versions of the data, or in other task formulations, as they are certainly valid aspects of scientific discourse.

Extracting the dataset

We’ll walk through the highglights of the statement extraction example in llamapun.

The example takes as input a path to a corpus directory, containing HTML files generated by latexml, and a target filename for a Tar archive.

  $ cargo run --release --example corpus_statement_paragraphs_model \
      /data/datasets/dataset-arXMLiv-08-2019/ /var/local/statement_paragraphs_arxmliv_08_2019.tar

We use a single builder to write the archive, held in a thread-friendly mutex, so that threads can keep adding data to the same target archive without any race conditions.

  use tar::{Builder, Header};
  struct TarBuilder {
    builder: Builder<File>,
    names: HashSet<String>,
  }
  let tar_builder = Arc::new(Mutex::new(TarBuilder { // ...

Our builder also bookkeeps a HashSet with previously seen paragraph shas, to ensure each entry added to the resource is distinct and there is no overlap between classes.

The main work is done in the parallel traversal via jwalk, yielding each document locally to a thread and executing the statement extraction code. It is implemented as part of llamapun’s parallel_data::Corpus.

  let mut corpus = Corpus::new(corpus_path);
  let catalog = corpus.catalog_with_parallel_walk(|doc| {
    extract_document_statements(doc, tar_builder.clone(), discard_math_flag)
  });

Onto the extraction logic, each document is inspected as follows:

  // [skip] some document-level context variables and checks
  // ‘extended_paragraph_iter‘ covers narrative paragraphs, abstracts, captions,
  ’paragraphs: for mut paragraph in document.extended_paragraph_iter() {
    let para = paragraph.dnm.root_node; // the underlying XML node
    // ...[skip]... setup for prev_heading_opt which contains Some(heading_node)
    // when the prior sibling of the paragraph is a heading title
    // we ignore all other paragraphs, except for the specially marked up cases of acknowledgement and caption, e.g.
    let special_marker = if para_class.contains("ltx_acknowledgement") {
      Some(StructuralEnv::Acknowledgement)
    } else if para_class.contains("ltx_caption") {
      Some(StructuralEnv::Caption)
    }
    // Before we go into tokenization, ensure this is an English paragraph
    if data_helpers::invalid_for_english_latin(&paragraph.dnm) {
      continue ’paragraphs;
    }

So far we have checked that a paragraph has special markup or is preceded by a heading, as well as it being identified as English, skipping over all others. Next, we can extract the precise label, and check it is in our whitelist of 46 classes.

    // I. Determine the class for this paragraph entry, so that we can iterate over its content after
    // if no markup at all, ignore the paragraph, as we don’t have reliable classification information
    let class_directory = if let Some(env) = special_marker {
      // case 1: special markup for caption and acknowledgement
      env.to_string()
    } else {
      // case 2: AMS markup + accepted AMS class
      let ams_class = if has_ams_markup {
        let parent_class = para_parent.get_attribute("class").unwrap_or_default();
        ams::class_to_env(&parent_class)
      } else {
        None
      };
      if let Some(env) = ams_class {
        match env {
          // Other and other-like entities that are too noisy to include
          // New for 2019: ignore the low-volume cases as well
          AmsEnv::Affirmation
          | AmsEnv::Algorithm
          // |.. [skip] 19 other variants
          | AmsEnv::Other => continue ’paragraphs,
          whitelisted => whitelisted.to_string(),
        }
      } else if let Some(heading_node) = prev_heading_opt {
        // case 3: structural heading markup
        if let Some(heading_text) = data_helpers::heading_from_node_aux(
          heading_node,
          &document.corpus.tokenizer,
          &mut context,
        ) {
          let env: StructuralEnv = heading_text.as_str().into();
          if env == StructuralEnv::Other {
            // any of the other 6+ million headings that are not whitelisted, ignore
            continue ’paragraphs;
          }
          // otherwise, any of the ‘StructuralEnv’ enum variants are accepted classes
          env.to_string()
        }
    //... skip other cases

There is a lot going behind the scenes in this snippet. The ams::class_to_env performs a rather ambitious lookup, mapping latex-defined environment names, cleanly and reliably preserved in the HTML attributes, down to their canonical statement classes. The work behind that mapping was part of a survey that went over 20,000 of the author-provided AMS classes, retaining the ones with clear and robust intent.

The data_helpers::heading_from_node_aux hides quite a significant amount of logic as well. It uses llamapun’s “document narrative map” abstractions (DNM) to obtain a robust plain text version of the heading element, discarding e.g. tag markup for section numbers and cleanly stripping away styling information. It then performs normalization on the plain-text, reducing a shorltist of known compound headings such as “Proof of theorem ref” down to proof. Finally, this normalized heading string is mapped into a StructuralEnv struct, checking it against the whitelist we experimentally defined, recasting anything outside it as an “Other” label.

At this point, we have skipped all paragraphs without a whitelisted statement class. We have retained special markup, whitelisted AMS markup, and whitelisted structural heading markup. Thus, knowing this is a paragraph to retain, we need to normalize it to a plain text form, derived from its HTML node:

  // II. We have a labeled statement. Extract content of current paragraph, validating basic data quality
  let mut word_count = 0;
  let mut invalid_paragraph = false;
  let mut paragraph_buffer = String::new();
  ’words: for word in paragraph.word_and_punct_iter() {
    let word_string = match data_helpers::ams_normalize_word_range(//...
    {
      Ok(w) => w,
      Err(_) => {
        invalid_count += 1;
        invalid_paragraph = true;
        break ’words;
      }
    };
    if !word_string.is_empty() {
      word_count += 1;
      paragraph_buffer.push_str(&word_string);
      paragraph_buffer.push(’ ’);
    }
  }
  // Discard paragraphs outside of a reasonable [4,1024] word count range
  if word_count < 4 || word_count > 1024 {
    invalid_count += 1;
    invalid_paragraph = true;
  }

  // If paragraph was valid and contains text, record it
  if !invalid_paragraph {

    paragraph_buffer.push(’\n’);
    paragraph_count += 1;
    // precompute sha inside the thread, to do more in parallel
    let paragraph_filename = hash_file_path(&class_directory, &paragraph_buffer);
    thread_data.push((paragraph_buffer, paragraph_filename));
  }

This is a bit more direct. We iterate over a paragraph’s words and punctuation, and collect words for our specific use case. The ams_normalize_word_range helper allows to pass in a set of configuration options and choose whether to e.g. keep or discard math, punctuation, letter case. It also internally handles substituting the MathML representation of formulas with their sub-formula lexemes²²2a special feature of this dataset, provided via latexml’s tokenization of math expressions. Valid paragraphs are collected with their on-archive name prepared, for followup serialization to disk.

Lastly, having collected all appropriate paragraphs for this document, we can lock the tar builder and write the data to disk and deallocate it, keeping the RAM footprint of the traversal contained.

  // III. Record valid entries into archive target, having collected all labeled samples for this document
  let mut builder_lock = tar_builder.lock().unwrap();
  for (paragraph_buffer, paragraph_filename) in thread_data.into_iter() {
    builder_lock
      .save(&paragraph_buffer, &paragraph_filename)
      .expect("Tar builder should always succeed.")
  }
  // IV. Bookkeep counts for final report and finish this document
  thread_counts.insert(String::from("paragraph_count"), paragraph_count);
  thread_counts.insert(String::from("invalid_count"), overflow_count);
  thread_counts
}

Each statement entry is named after the SHA-256 of its contents, and is added to one of 46 subdirectories named after the statement class.

Three and a half hours later, a 40 GB tar file, containing 22.1 million statement paragraphs is ready for experimentation!

The arXMLiv Statement Classification Dataset, 2019

While the numbers and scope in this post are still tentative, I can report a very promising look into the new extraction run over the 1.37 million arXiv articles, upto 08.2019.

With 22.1 million paragraphs collected, from a total of 97.6 million, as defined by our “extended paragraph” iterator, we are retaining 22% from the total paragraph volume available in the dataset. As a very loose estimate, given that our embeddings statistitcs show $\approx 15.2$ billion tokens from all paragraphs, then we could estimate this statement set contains $\approx 3$ billion tokens.

Thus, for our 46 classes of choice, the distinct paragraphs extracted are ranked as follows:

Class	Entries
caption	7,098,238
proof	2,719,458
lemma	1,513,073
theorem	1,510,103
abstract	1,167,923
introduction	1,056,110
proposition	940,306
definition	844,670
remark	797,994
acknowledgement	680,991
conclusion	511,117
corollary	493,600
example	390,229
model	343,543
result	299,991
discussion	192,629
summary	139,725
problem	126,985
experiment	120,689
analysis	120,661
methods	119,913
claim	94,910
observation	70,621
notation	69,567
preliminaries	68,695
property	65,284
conjecture	64,350
simulation	59,396
related work	54,910
condition	46,124
assumption	40,409
question	39,777
background	34,819
contribution	29,205
description	25,337
demonstration	24,984
fact	20,846
motivation	16,887
case	15,058
step	14,255
application	13,212
future work	12,263
implementation	10,849
data	10,589
dataset	9,738
theory	7,184

Table 1: Tentative Statement Classification Dataset, 2019 edition

For convenience, here is the same table in alphabetical order:

Class	Entries
abstract	1,167,923
acknowledgement	680,991
analysis	120,661
application	13,212
assumption	40,409
background	34,819
caption	7,098,238
case	15,058
claim	94,910
conclusion	511,117
condition	46,124
conjecture	64,350
contribution	29,205
corollary	493,600
data	10,589
dataset	9,738
definition	844,670
demonstration	24,984
description	25,337
discussion	192,629
example	390,229
experiment	120,689
fact	20,846
future work	12,263
implementation	10,849
introduction	1,056,110
lemma	1,513,073
methods	119,913
model	343,543
motivation	16,887
notation	69,567
observation	70,621
preliminaries	68,695
problem	126,985
proof	2,719,458
property	65,284
proposition	940,306
question	39,777
related work	54,910
remark	797,994
result	299,991
simulation	59,396
step	14,255
summary	139,725
theorem	1,510,103
theory	7,184

Table 2: Tentative Statement Classification Dataset, 2019 edition (alphabetical)

Note on Reuse

The final tar file is at its worst containing 7 million files in a single subdirectory. Using that setup unpacked can outright lead to errors with your local filesystem, or lead to extreme slowness in operations that were not written with large directories in mind. So instead, the tar is best used by walking it directly, and re-mapping the data into another resource. Commonly, I would use the word embeddings to map each token string to its embedding id, do the same for the class and label id, and transfer that now model-specific data to an HDF5 file, ready to be used in a Jupyter notebook workflow.

I haven’t made this new statement set public yet, but certainly intend to do so shortly, after another couple of integrity checks. Until then I warmly recommend getting started with the 2018 statement classification set if modeling scientific discourse spikes your interest!