Daily Archives: February 24, 2009

Python and Text Processing

During the Christmas vacation, I’ve played some more with Python as I really like its simplicity and consistency (as a side note I really wish other languages would have the same level of consistency).

I’ve put together a short list of Python resources for text processing. While, I haven’t used all of them, in most cases they seemed to be exactly what I’ve been looking for.

r30741m.jpg

Natural Language Processing

Tokenization

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

While I have found the found the following simple tokenizer, I’ve also written mine which doesn’t use regexps

def tokenize(sentence):
  '''Tokenize the given `sentence`.'''
  words = []
  j = 0
  end = len(sentence) - 1
  for i in xrange(len(sentence)):
    if not sentence[i].isalnum():
      if (sentence[i] == '.' or sentence[i] == ',') and (i > 0 and i < end):
        # if inside a number
        if sentence[i - 1].isdigit() and sentence[i + 1].isdigit(): 
          continue
      words.append(sentence[j:i])
      j = i + 1
  if j <= end:
    words.append(sentence[j:])
  return [w for w in words if w]

The only thing worth mentioning about the above tokenizer is that it is not breaking the formatted numbers (but it will break dates separated by / or -).

Stemming

The original Porter Stemmer is available also in Python (it looks like it is a simple translation of the C version without using any Python idioms).

Advertisement

2 Comments

Filed under technolog