Python and Text Processing

During the Christmas vacation, I’ve played some more with Python as I really like its simplicity and consistency (as a side note I really wish other languages would have the same level of consistency).

I’ve put together a short list of Python resources for text processing. While, I haven’t used all of them, in most cases they seemed to be exactly what I’ve been looking for.

r30741m.jpg

Natural Language Processing

Tokenization

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

While I have found the found the following simple tokenizer, I’ve also written mine which doesn’t use regexps

def tokenize(sentence):
  '''Tokenize the given `sentence`.'''
  words = []
  j = 0
  end = len(sentence) - 1
  for i in xrange(len(sentence)):
    if not sentence[i].isalnum():
      if (sentence[i] == '.' or sentence[i] == ',') and (i > 0 and i < end):
        # if inside a number
        if sentence[i - 1].isdigit() and sentence[i + 1].isdigit(): 
          continue
      words.append(sentence[j:i])
      j = i + 1
  if j <= end:
    words.append(sentence[j:])
  return [w for w in words if w]

The only thing worth mentioning about the above tokenizer is that it is not breaking the formatted numbers (but it will break dates separated by / or -).

Stemming

The original Porter Stemmer is available also in Python (it looks like it is a simple translation of the C version without using any Python idioms).

2 Comments

Filed under technolog

2 responses to “Python and Text Processing

  1. I’ve started using the python nltk. I’ve found it incredibly useful. In only a handful of lines I’ve been able to throw together a knowledge extraction tool for biological experiment annotations.

  2. Adinel

    +1🙂 It seems that is good in testing software field. I followed this presentation and I’ve been able to make my own tool using ntlk. You should check it out too: http://thomas-zimmermann.com/publications/details/bettenburg-fse-2008/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s