Daily Archives: January 3, 2009

BeautifulSoup or SGMLParser Bug

If you are reading this, you already know what BeautifulSoup is and how useful it is while working with XML/HTML in Python (in case you are not familiar with it, I’d encourage you to read its documentation). So I’ll just skip to the main reason of this post: a bug in parsing the <script> tags in HTML documents.

According to the documentation, BeautifulSoup knows how to handle the body of a <script> tag, meaning that it knows to treat its content as a pure string and not perform any additional parsing on it. Unfortunately, I’ve discovered a corner case where it behaves incorrectly.

Here is the sample HTML that will reveal the bug:

  <script type='text/javascript'>

The problem is that the string ‘</script>’ tricks the parser to believe that the end of the <script> tag is reached and so instead of getting a single Tag from the <script> HTML tag it basically results in 2 elements: a Tag and a NavigableString that contains the rest of the <script> tag (i.e. what comes after the ‘</script>’ string: '); document.write('<div></div>');).

This basically means that for any HTML that contains a similar fragment rewriting it will lead to broken <script>s. Unfortunately, I haven’t been able to figure out a solution. My impression is that this parsing happens at a very low level and this makes me think that the bug might not be one of BeatifulSoup but rather a bug in SGMLParser.

The affected version is 3.0.7a. Meanwhile it looks like a new release has seen the light, but I haven’t tested it yet. The new BeautifulSoup 3.1.0 has replaced the SGMLParser with HTMLParser (in the attempt to make BeautifulSoup compatible with Python 3.0) so this bug might be already fixed.

If we are at bugs, I’d also like to mention one in Python 2.5.2 MacOS:

Python(72261) malloc: *** mmap(size=2097152) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Exception exceptions.MemoryError: MemoryError() in  ignored

Things are much simpler with this one, even if the displayed information doesn’t offer enough details. The above bug is basically the result of adding strings to a list in an infinite loop (so a programming problem, but with no indication of the error).


Filed under technolog