18 Nov 2014 17:34

VARD and EEBO TCP

For the past few months or so, when I've had the chance in between teaching, research, and my own work, I've been assisting the Early Modern Conversions project here at McGill in building a corpus tool for Early English Books Online using the data from the Text Creation Project (EEBO TCP). It's been interesting: the objective is to create texts that have some measure of orthographic consistency so that large scale text-analysis tools can be used on them - things like topic modelling for instance. Because of the variants in spelling, scholars normally can't do much with these texts.

We've been using a tool call VARD2, which uses statistical analysis to alter variant spellings in early modern English. The specifics are beyond me, but in short VARD suggests matches for modern spellings, and a statistical probability factor or threshold. You can also train the program in order to create stop lists, dictionaries etc. Combined it means that we can train VARD to a certain degree and then run it on a range of texts, asking it to 'normalize' all the variant spellings which match at 50% probability or better. With me so far?

The issues we've been running into are two fold - first the XML we've obtained from the Text Creation Project, while valid for the most part, isn't always consistent. There are nested elements, for instance, and some validation problems. Not too bad, but enough that it creates an issue when you're trying to extract elements from XML documents that are marked 'ENGLISH': VARD can't handle Latin, so why take the time to try, right? The method then involves creating sets of EEBO TCP documents that contain only English text, in valid XML, for some 40,000 documents, which we can then 'VARD' as we've come to call it.

The second problem is VARD itself - while we could use a batch command, the reality is that 40,000 texts makes VARD choke, pretty quick. It loads everything into memory, doesn't dump it (from what I can see), and then tanks when we try to process files ranging from 50KB to 6MB of text. So we've been running it on single files with high memory settings on both the managing PHP script and the VARD java commandline call as well - 1GB for PHP, and 1GB for VARD. So far, so good. But it's resource heavy, even on a 6Core MacPro and takes days to complete.

What we WILL have at the end, however, is a set of 40,000+ early modern texts (containing some 1.5 Billion words, and involving thousands of variants) with a modicum of normalized spelling. We're not flattening everything - for instance, we're leaving 'peece' because it's entirely contextual if 'peece' means 'piece' or 'peace'. But we are cleaning up the files by removing non-alphanumeric characters like |{}{}+ which are used in EEBO to represent line breaks or notes. We're also expanding vowels with macrons and hoping that VARD will discern whether the following letter ought to be a 'm' or a 'n'. Lastly, we've opted to leave the illegible characters alone, as there's no way to really be sure what's going on.

As an early modernist I'm excited at the idea that we'll be able to run text-analysis tools on 1/3 of the entire corpus of printed early modern English texts. While it won't be perfect, it will allow us to gain a sense of what kinds of themes and topics might dominant print discourse at a given moment - or at least provide us the means of thinking through these kinds of problems.

back

VARD and EEBO TCP

Posts

About

@milner_matt