Skip to content Skip to sidebar Skip to footer

Replace Html Entities With The Corresponding Utf-8 Characters In Python 2.6

I have a html text like this: <xml ... > and I want to convert it to something readable: Any easy (and fast) way to do it in Python?

Solution 1:

Python >= 3.4

Official documentation for HTMLParser: Python 3

>>>from html import unescape>>>unescape('&copy; &euro;')
© €

Python < 3.5

Official documentation for HTMLParser: Python 3

>>>from html.parser import HTMLParser>>>pars = HTMLParser()>>>pars.unescape('&copy; &euro;')
© €

Note: this was deprecated in the favor of html.unescape().

Python 2.7

Official documentation for HTMLParser: Python 2.7

>>> import HTMLParser
>>> pars = HTMLParser.HTMLParser()
>>> pars.unescape('&copy; &euro;')
u'\xa9 \u20ac'>>> print _
© €

Solution 2:

Modern Python 3 approach:

>>>import html>>>html.unescape('&copy; &euro;')
© €

https://docs.python.org/3/library/html.html

Solution 3:

There is a function here that does it, as linked from the post Fred pointed out. Copied here to make things easier.

Credit to Fred Larson for linking to the other question on SO. Credit to dF for posting the link.

Post a Comment for "Replace Html Entities With The Corresponding Utf-8 Characters In Python 2.6"