During the Christmas vacation, I’ve played some more with Python as I really like its simplicity and consistency (as a side note I really wish other languages would have the same level of consistency).
I’ve put together a short list of Python resources for text processing. While, I haven’t used all of them, in most cases they seemed to be exactly what I’ve been looking for.
Natural Language Processing
Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.
While I have found the found the following simple tokenizer, I’ve also written mine which doesn’t use regexps
def tokenize(sentence): '''Tokenize the given `sentence`.''' words =  j = 0 end = len(sentence) - 1 for i in xrange(len(sentence)): if not sentence[i].isalnum(): if (sentence[i] == '.' or sentence[i] == ',') and (i > 0 and i < end): # if inside a number if sentence[i - 1].isdigit() and sentence[i + 1].isdigit(): continue words.append(sentence[j:i]) j = i + 1 if j <= end: words.append(sentence[j:]) return [w for w in words if w]
The only thing worth mentioning about the above tokenizer is that it is not breaking the formatted numbers (but it will break dates separated by / or -).
The original Porter Stemmer is available also in Python (it looks like it is a simple translation of the C version without using any Python idioms).
2 responses to “Python and Text Processing”
I’ve started using the python nltk. I’ve found it incredibly useful. In only a handful of lines I’ve been able to throw together a knowledge extraction tool for biological experiment annotations.
+1 🙂 It seems that is good in testing software field. I followed this presentation and I’ve been able to make my own tool using ntlk. You should check it out too: http://thomas-zimmermann.com/publications/details/bettenburg-fse-2008/