Tag Archives: Python

Python and Text Processing

During the Christmas vacation, I’ve played some more with Python as I really like its simplicity and consistency (as a side note I really wish other languages would have the same level of consistency).

I’ve put together a short list of Python resources for text processing. While, I haven’t used all of them, in most cases they seemed to be exactly what I’ve been looking for.

r30741m.jpg

Natural Language Processing

Tokenization

Tokenization is the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

While I have found the found the following simple tokenizer, I’ve also written mine which doesn’t use regexps

def tokenize(sentence):
  '''Tokenize the given `sentence`.'''
  words = []
  j = 0
  end = len(sentence) - 1
  for i in xrange(len(sentence)):
    if not sentence[i].isalnum():
      if (sentence[i] == '.' or sentence[i] == ',') and (i > 0 and i < end):
        # if inside a number
        if sentence[i - 1].isdigit() and sentence[i + 1].isdigit(): 
          continue
      words.append(sentence[j:i])
      j = i + 1
  if j <= end:
    words.append(sentence[j:])
  return [w for w in words if w]

The only thing worth mentioning about the above tokenizer is that it is not breaking the formatted numbers (but it will break dates separated by / or -).

Stemming

The original Porter Stemmer is available also in Python (it looks like it is a simple translation of the C version without using any Python idioms).

2 Comments

Filed under technolog

BeautifulSoup or SGMLParser Bug

If you are reading this, you already know what BeautifulSoup is and how useful it is while working with XML/HTML in Python (in case you are not familiar with it, I’d encourage you to read its documentation). So I’ll just skip to the main reason of this post: a bug in parsing the <script> tags in HTML documents.

10.1.jpg
According to the documentation, BeautifulSoup knows how to handle the body of a <script> tag, meaning that it knows to treat its content as a pure string and not perform any additional parsing on it. Unfortunately, I’ve discovered a corner case where it behaves incorrectly.

Here is the sample HTML that will reveal the bug:

<html>
<head></head>
<body>
  <script type='text/javascript'>
    document.write('</script>');
    document.write('<div></div>');
  </script>
</body>
</html>

The problem is that the string ‘</script>’ tricks the parser to believe that the end of the <script> tag is reached and so instead of getting a single Tag from the <script> HTML tag it basically results in 2 elements: a Tag and a NavigableString that contains the rest of the <script> tag (i.e. what comes after the ‘</script>’ string: '); document.write('<div></div>');).

This basically means that for any HTML that contains a similar fragment rewriting it will lead to broken <script>s. Unfortunately, I haven’t been able to figure out a solution. My impression is that this parsing happens at a very low level and this makes me think that the bug might not be one of BeatifulSoup but rather a bug in SGMLParser.

The affected version is 3.0.7a. Meanwhile it looks like a new release has seen the light, but I haven’t tested it yet. The new BeautifulSoup 3.1.0 has replaced the SGMLParser with HTMLParser (in the attempt to make BeautifulSoup compatible with Python 3.0) so this bug might be already fixed.

If we are at bugs, I’d also like to mention one in Python 2.5.2 MacOS:

MemoryError
Python(72261) malloc: *** mmap(size=2097152) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Exception exceptions.MemoryError: MemoryError() in  ignored

Things are much simpler with this one, even if the displayed information doesn’t offer enough details. The above bug is basically the result of adding strings to a list in an infinite loop (so a programming problem, but with no indication of the error).


3 Comments

Filed under technolog

Python, Unicode and UTF8

I’ve thought of putting together a short list of links as a reference on how to handle Unicode, UTF8 in Python.

Before jumping to Python resources, you should also read Joel’s article on Unicode and Character sets (before going forward you need to be sure that Unicode and UTF8 are clear).

Python, Unicode and UTF8

You can read about unicode in the newer versions 2.6 and 3.0 respectively.

Python 3.0 has completely revamped Unicode usage and even if I don’t think there are many places where Py3k is in production, you should make sure that you read about these changes.

If you have other good links about Python, Unicode and UTF8 just drop a comment.

Leave a comment

Filed under links, technolog

Quick Python Reference

Python support for Internet Protocols

The documentation for the packages for Internet Protocol handling can be found at Internet Protocols and Support

  • urllib: This module provides a high-level interface for fetching data across the World Wide Web
  • urllib2: efines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.
  • httplib: defines classes which implement the client side of the HTTP and HTTPS protocols. It is normally not used directly — the module urllib uses it to handle URLs that use HTTP and HTTPS.
  • urlparse: defines a standard interface to break URL strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL given a “base URL.”
  • cookielib: defines classes for automatic handling of HTTP cookies. It is useful for accessing web sites that require small pieces of data – cookies – to be set on the client machine by an HTTP response from a web server, and then returned to the server in later HTTP requests.
  • Cookie: defines classes for abstracting the concept of cookies, an HTTP state management mechanism. It supports both simple string-only cookies, and provides an abstraction for having any serializable data-type as cookie value.
  • uuid: provides immutable UUID objects (the UUID class) and the functions uuid1(), uuid3(), uuid4(), uuid5() for generating version 1, 3, 4, and 5 UUIDs as specified in RFC 4122.

urllib2 tricks

As far as I can tell urllib2 supports by default only GET and POST requests (ref). In order to be able to generate the other types of requests (PUT, DELETE, OPTION) I think you’ll need to extend urllib2.Request and override the get_method() method to return the type of request you want to make.

Special method names

A class can implement certain operations that are invoked by special syntax (such as arithmetic operations or subscripting and slicing) by defining methods with special names.This is Python’s approach to operator overloading, allowing classes to define their own behavior with respect to language operators.

Special method names

The How-To Guide for Descriptors defines descriptors, summarizes the protocol, and shows how descriptors are called.

classmethod and staticmethod

It’s still not very clear what is the difference between the @classmethod and @staticmethod (except the first parameter accepted by the annotated method — for @classmethod it is the class).

Future versions

1 Comment

Filed under links, technolog