During the Christmas vacation, I’ve played some more with Python as I really like its simplicity and consistency (as a side note I really wish other languages would have the same level of consistency).
I’ve put together a short list of Python resources for text processing. While, I haven’t used all of them, in most cases they seemed to be exactly what I’ve been looking for.
Natural Language Processing
Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.
While I have found the found the following simple tokenizer, I’ve also written mine which doesn’t use regexps
def tokenize(sentence): '''Tokenize the given `sentence`.''' words =  j = 0 end = len(sentence) - 1 for i in xrange(len(sentence)): if not sentence[i].isalnum(): if (sentence[i] == '.' or sentence[i] == ',') and (i > 0 and i < end): # if inside a number if sentence[i - 1].isdigit() and sentence[i + 1].isdigit(): continue words.append(sentence[j:i]) j = i + 1 if j <= end: words.append(sentence[j:]) return [w for w in words if w]
The only thing worth mentioning about the above tokenizer is that it is not breaking the formatted numbers (but it will break dates separated by / or -).
The original Porter Stemmer is available also in Python (it looks like it is a simple translation of the C version without using any Python idioms).