Removing HTML from Python Strings

It is incredibly easy to remove HTML from a Python string using the lxml library. All you need to do is something similar to the below snippet:

# remove HTML from the string content and place the result in doc
doc = html.document_fromstring(content)
# get the text content of doc (with no markup) and place the result in text_doc 
text_doc = doc.text_content()

thus text_doc now contains the text of the original string with no HTML markup in it. Now you can use something like Postmarkup so that users can use BBCode or a similar markup tool to style the text that they place on your site without having the security risk of allowing users to post HTML.

If you are using Django it is essential that you mark your strings as safe using the safe template tag so that Django does not automatically escape the HTML created legitimately from the BBCode.

A full example that uses Postmarkup and lxml to first remove any HTML in a post and then render the BBCode into HTML for display on a site:

from postmarkup import render_bbcode
from lxml import html
 
doc = html.document_fromstring(content)
bbcode = doc.text_content()
content = render_bbcode(bbcode)

Leave a Reply