BeautifulSoup or SGMLParser Bug

If you are reading this, you already know what BeautifulSoup is and how useful it is while working with XML/HTML in Python (in case you are not familiar with it, I’d encourage you to read its documentation). So I’ll just skip to the main reason of this post: a bug in parsing the <script> tags in HTML documents.

10.1.jpg
According to the documentation, BeautifulSoup knows how to handle the body of a <script> tag, meaning that it knows to treat its content as a pure string and not perform any additional parsing on it. Unfortunately, I’ve discovered a corner case where it behaves incorrectly.

Here is the sample HTML that will reveal the bug:

<html>
<head></head>
<body>
  <script type='text/javascript'>
    document.write('</script>');
    document.write('<div></div>');
  </script>
</body>
</html>

The problem is that the string ‘</script>’ tricks the parser to believe that the end of the <script> tag is reached and so instead of getting a single Tag from the <script> HTML tag it basically results in 2 elements: a Tag and a NavigableString that contains the rest of the <script> tag (i.e. what comes after the ‘</script>’ string: '); document.write('<div></div>');).

This basically means that for any HTML that contains a similar fragment rewriting it will lead to broken <script>s. Unfortunately, I haven’t been able to figure out a solution. My impression is that this parsing happens at a very low level and this makes me think that the bug might not be one of BeatifulSoup but rather a bug in SGMLParser.

The affected version is 3.0.7a. Meanwhile it looks like a new release has seen the light, but I haven’t tested it yet. The new BeautifulSoup 3.1.0 has replaced the SGMLParser with HTMLParser (in the attempt to make BeautifulSoup compatible with Python 3.0) so this bug might be already fixed.

If we are at bugs, I’d also like to mention one in Python 2.5.2 MacOS:

MemoryError
Python(72261) malloc: *** mmap(size=2097152) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Exception exceptions.MemoryError: MemoryError() in  ignored

Things are much simpler with this one, even if the displayed information doesn’t offer enough details. The above bug is basically the result of adding strings to a list in an infinite loop (so a programming problem, but with no indication of the error).


Advertisements

3 Comments

Filed under technolog

3 responses to “BeautifulSoup or SGMLParser Bug

  1. Adinel

    Did you tried the other side of parsing, and by this I’m naming lxml?
    Offtopic note: at this moment I’m giving a shot to Scrapy

  2. Alex Popescu (aka the_mindstorm)

    No, I haven’t tried any of those as until now I’ve been quite happy with BeautifoulSoup.
    Skimming over lxml I’ve noticed that it is a binding for libxml2 which I’m pretty sure that it doesn’t exist on Win machines, but I might be wrong.
    I’ll take a deeper look at those. Thanks for your suggestions.

  3. Adinel

    You’re wellcome! Actually I’m in debt to you, because I’m learning everyday from InfoQ.
    On BeatifulSoup vs lxml here is a nice blog post that I found a couple of days ago

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s