suffixes

Description

suffixes is a program written in C++ which tries to guess root/suffix boundaries within a word. For example, given the word "jumpers" it would (hopefully) output jump|er|s, meaning that the root is jump, and there are two suffixes: er and s. More properly speaking, suffixes tries to guess suffix morpheme boundaries. It will run on any standard UNIX system, and it should compile in UNIX environment like Cygwin, though I haven't tested it personally.

Installation

Download the source using the link above and uncompress it. A makefile is contained, so all that is required to compile it is to run make in the source directory. The executable is called suffixes.

Usage

For now just run suffixes -h for a (terse) description of the command line parameters

Technical details

Here I'm going to try to explain the idea of successor frequency and my implementation of it. First, we need a little linguistic background. Linguistics is the scientific study of natural language. It is usually classified as a social science, although many subdisciplines such as syntax and (generative) phonology involve some pretty high-falutin' mathematics (e.g., formal language theory). Here we're interested in morphology.

Morphology is the study of word structure. In a lose sense we're interested in answering the question, "What is a word and what is a word made of?" Oddly enough, linguists don't have a very good answer for "What is a word?" aside from "We know it when we see it," but their answer for "What is a word made of?" is something called a morpheme. Roughly speaking a morpheme is the smallest subpart of a word which has meaning. Take the word dogs. The substring og has no meaning whatsoever. The substring dog, however, does, meaning a canine, as does the suffix -s, which indicates plurality. Hence dogs consists of two morphemes: dog and -s. Here dog is called the root and -s is a suffix, or, more generally, an affix. The key to morphology is this: a word is more than just a bag of letters.

As a native speaker of a language, given a word, you can usually find the morphemes reliably. Sometimes there might be ambiguity, but those times are rarer than not. But what if we wanted to teach a computer the morphology of, say, English? It sees a word as a bag of letters, with no substring having any substantive meaning on its own.

One of the easiest solutions rests with the idea of successor frequency. First, we have somewhere a big corpus of English, say, every news article printed in English in the year 2005. The program keeps track of each string, and for each substring keeps track of how many letters can follow that substring. For example, take the word jumping#, where '#' denotes the end of a word.

 6 3 1 4 1 1 1
j u m p i n g #

This means that given the string 'j' there are 6 possible letters which can follow it, where here 'possible' means that we saw it in our big corpus of English. There are three letters which can follow the string 'ju', perhaps attested by jump, juice, and just. After 'jum' there is only one possible letter: 'p'. However, after 'jump' there are four possibilities, probably jump#, jumping, jumps, and jumped.

It should probably be clear where the suffix boundary is given the above data; it's between jump and ing, and we know this because there's a spike in the successor frequency there. Of course, there are spikes at the beginning of the word, too, but this is to be expected. Given only two or three letters, there are many words which begin with them but in which these letters form no morpheme. So we limit our investigation to finding suffixes for words of a certain minimum length (at least four characters long). If you're clever you'd realize that we could implement a "predecessor frequency" algorithm by running each string through our algorithm backwards.

If you have any linguistic background you should be able to spot some problems with this right away. For one, we're dealing with written (orthographical) English here, which makes no sense. For example, in certain cases in written English we double a consonant before certain suffixes, e.g., dog versus dogged. The extra 'g' is an artifact of our orthography and has no phonetic realization. Similarly, words which have other superfluous orthography, like a final "silent e" as in retrieve will mess up our algorithm. Properly speaking the input should not be orthographic English but rather a representation of spoken English.

A second, more technical problem, is that some suffixes might end in the same letter, e.g., -ed and -er. Although these represent three different suffixes our algorithm lumps them all together as one. The opposite can happen, too, as in the case with -es, -s, and -z which, while orthographically distinct, all represent the same morpheme, viz., that which designates plurality.

These failings aside, the successor frequency algorithm is surprisingly good at guessing suffix boundaries. As a final solution it needs work, but often it is used as a bootstrapping heuristic for more complex algorithms which need a basic understanding of the morphology before they can proceed to refine it.