ABIS Infor - 2016-06

Perl Text Analytics

Peter Vanroose (ABIS) - 15 June 2016

Summary

On the Internet, enormous amounts of (text) data can be found: undoubtedly very important and useful to do "something" with it, especially in the context of "business intelligence", if we don't want to miss anything essential. Only... the big volumes and the enormous diversity of this data make it extremely difficult to do meaningful Text Analytics. Maybe the programming language Perl offers an interesting added value in this context?

Text analytics and Big Data

Everybody seems to be working with "Big Data" nowadays, but not so many people know that this term does not in the first place refer to large data volumes, but rather the possibility to quickly set up data analysis, especially when the entering data don't match a pre-defined formal (database) structure, and/or are variable & varying in structure, or are entering at high speed and for that reason can't be stored in attendance of the analytics to be done.

Want to know more about what precisely is Big Data? We gladly refer to our course "Big Data Concepts"¹.

A popular special case of data analysis consists of analysing written text, e.g. Twitter messages, web blogs, newspaper articles, e-mails, you name it. Text analytics often aim at (quickly) extracting relevant elements from the text, e.g. detecting mood or temper (angry, enthusiastic, ...), classifying the text, or summarizing the essential information. This is called NLP (natural language processing).

Perl and the processing of data streams

Perl is an extraordinary programming language: on the one hand, a Perl program can very efficiently execute some "low-level" operations, at the operating system level: read/write files and data streams, launch sub-processes, perform system calls; on the other hand, Perl has a surprisingly "high-level" syntax, with data structures like dynamic lists, associative arrays and hierarchic structures; and with program structures (iterations, conditional execution, function calls) that can be formulated in a very "natural" way, almost as a natural language. Not surprisingly, Perl's inventor, Larry Wall, is a linguist!

Reading and continuously processing a data stream, whether or not pre-segmented in lines (viz. End-Of-Line separated blocks) is such a standard activity. Perl syntactically allows a very readable formulation for such operations, without compromising the "runtime" efficiency, which is as if it were written out in assembler.

A simple example: keep only the sufficiently large lines of input (e.g. those longer than 20 characters):

	while (<>) { print if (length > 20); }

A more interesting example: write out the word lengths of all words of a text (e.g. to generate a word length histogram, in preparation of a writing style analysis):

	$/ = ' '; while (<>) { print length . "\n"; }

Or better yet, immediately let Perl generate the word counts per length (i.e., the histogram itself):

	$/=' '; while(<>){$c{length($_)}++;} foreach (keys %c) {print "$_\t$c{$_}\n";}

The first assignment redefines the "record separator" such that now words are read instead of lines. Strictly speaking this is too simplistic, since text does not always contain exactly one blank between two words. The text should first be preprocessed by an other Perl program which replaces all "word separators" (group of blanks or punctuation) by one single blank:

	while (<>) { s/\W+/ /g; print; }

Regular expressions

The "\W+" in that last Perl progrm is already a good example of a so-called regular expression: a description of a "text pattern", in this case: "one or more blanks or punctuation". The "RegExp" language is very powerful, be it rather cryptic for the layman. Regular expressions can also be used in a lot of places outside of Perl. The nice thing is that, unlike those other environments, this 'language' is built into Perl, hence is processed very efficiently. In a Text Analytics context, regular expressions are indispensable for the quick and flexible 'parsing' of data streaming in.

A relatively simple example: we want to know whether a text contains dates. There are of course many possible ways to write a date, but suppose we restrict ourselves to the formats 15 Jun 2016, 15/06/2016, and 2016-06-15. The following Perl program will mention "date found" if this is the case, and be quiet otherwise:

	@mnd = qw('jan feb mar apr may jun jul aug sep oct nov dec');
	while (<>) {
	  &found if (  m!\b[0-3]\d/[01]\d/20\d\d\b!
	            or m!\b20\d\d-[01]\d-[0-3]\d\b!);
	  foreach $m (@mnd) { &found if (m/\b[0-3]\d $m 20\d\d\b/i); }
	}
	sub found { print 'date found'; exit 0; }

Similar programs could relatively easily and efficiently search for other "patterns" like moods (based on a "dictionary" of terms reflecting kinds of temper, and a minimal syntactic parser to appreciate words like "not", "particularly", "seldom" etc.) The required Perl program will quickly become a lot more complicated, but remains very efficient and yet very readable thanks to the beautiful, compact syntax of Perl.

More about Perl in our "Perl Programming Fundamentals Course"².

Perl as glue

Apart from the option of implementing a (text analytics) algorithm completely in Perl, this language is especially suited as a glue between existing software: why re-invent the wheel, if someone else provides you with an optimal implementation? In most cases, the practical problem then becomes: letting the different solutions for sub-problems "speak to each other". Examples are: Hadoop for data storage and Map-Reduce; R for statistical visualisation; Google Earth for geographic visualisation, and so on. Each of these tools expects input in a particular format, expects to run in a certain "setup", writes its results in a particular manner, ...

In those cases, Perl is the ideal partner: one or more small Perl programs link together the different components, stream the data between those data processors, and convert where needed from one format to the other.

Also for the somewhat more complex functionality to be taken care of inside Perl, you won't have to implement it yourself in most cases: Perl has a very extensive library (consisting of so-called modules) to be downloaded from CPAN³. One of the most interesting Perl modules in the context of data processing is DBI: Perl's ODBC style Database Interface (which we treat in our course "ODBC with Perl"⁴), with which relational databases but also e.g. CSV files can be read and modified. Furthermore, there are modules which already offer some Text Analytics functionalities, like e.g. Search::QueryParser which allows to search through a text with flexible "approximate" search terms, comparable to those of Google Search.

Conclusion

One should certainly not forget to consider Perl when building a Big Data infrastructure, especially when Text Analytics come into play!

References: