Development Tip

HTML 엔티티를 유니 코드로 또는 그 반대로 변환

yourdevel 2020. 11. 26. 19:55
반응형

HTML 엔티티를 유니 코드로 또는 그 반대로 변환


가능한 중복 :

Python에서 HTML 엔티티를 유니 코드로 또는 그 반대로 변환하는 방법은 무엇입니까?


"그 반대의 경우"에 관해서는 (내가 필요로하여 도움이되지 않은이 질문을 찾도록 이끌었고, 그 후 답변이있는 다른 사이트 ) :

u'some string'.encode('ascii', 'xmlcharrefreplace')

ASCII가 아닌 문자가 XML (HTML) 엔티티로 변환 된 일반 문자열을 반환합니다.


BeautifulSoup 이 있어야합니다 .

from BeautifulSoup import BeautifulStoneSoup
import cgi

def HTMLEntitiesToUnicode(text):
    """Converts HTML entities to unicode.  For example '&' becomes '&'."""
    text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
    return text

def unicodeToHTMLEntities(text):
    """Converts unicode to HTML entities.  For example '&' becomes '&'."""
    text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
    return text

text = "&, ®, <, >, ¢, £, ¥, €, §, ©"

uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)

print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &amp;, &#174;, &lt;, &gt;, &#162;, &#163;, &#165;, &#8364;, &#167;, &#169;

Python 2.7 및 BeautifulSoup4 업데이트

Unescape-유니 코드 HTML을 htmlparser(Python 2.7 표준 lib) 로 유니 코드합니다 .

>>> escaped = u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

Unescape- bs4(BeautifulSoup4) 로 유니 코드 할 유니 코드 HTML :

>>> html = '''<p>Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood

이스케이프- bs4(BeautifulSoup4)로 HTML을 유니 코드하는 유니 코드 :

>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Cur&eacute; of the &laquo;Notre-Dame-de-Gr&acirc;ce&raquo; neighborhood'

As hekevintran answer suggests, you may use cgi.escape(s) for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True keyword argument alongside your string. But even by passing quote=True, the function won't escape single quotes ("'") (Because of these issues the function has been deprecated since version 3.2)

It's been suggested to use html.escape(s) instead of cgi.escape(s). (New in version 3.2)

Also html.unescape(s) has been introduced in version 3.4.

So in python 3.4 you can:

  • Use html.escape(text).encode('ascii', 'xmlcharrefreplace').decode() to convert special characters to HTML entities.
  • And html.unescape(text) for converting HTML entities back to plain-text representations.

If someone like me is out there wondering why some entity numbers (codes) like &#153; (for trademark symbol), &#128; (for euro symbol) are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined.

Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4

So, we will have to workaround somehow (find & replace those at first)

Reference (starting point) from Mozilla's documentation

https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings


I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:

def html_wr(f, dat):
    ''' write dat to file f as html
        . file is assumed to be opened in binary format
        . if dat is nul it is replaced with non breakable space
        . non-ascii characters are translated to xml       
    '''
    if not dat:
        dat = '&nbsp;'
    try:
        f.write(dat.encode('ascii'))
    except:
        f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))

hope this is useful to somebody

참고URL : https://stackoverflow.com/questions/701704/convert-html-entities-to-unicode-and-vice-versa

반응형