HTML 엔티티를 유니 코드로 또는 그 반대로 변환
가능한 중복 :
Python에서 HTML 엔티티를 유니 코드로 또는 그 반대로 변환하는 방법은 무엇입니까?
"그 반대의 경우"에 관해서는 (내가 필요로하여 도움이되지 않은이 질문을 찾도록 이끌었고, 그 후 답변이있는 다른 사이트 ) :
u'some string'.encode('ascii', 'xmlcharrefreplace')
ASCII가 아닌 문자가 XML (HTML) 엔티티로 변환 된 일반 문자열을 반환합니다.
BeautifulSoup 이 있어야합니다 .
from BeautifulSoup import BeautifulStoneSoup
import cgi
def HTMLEntitiesToUnicode(text):
"""Converts HTML entities to unicode. For example '&' becomes '&'."""
text = unicode(BeautifulStoneSoup(text, convertEntities=BeautifulStoneSoup.ALL_ENTITIES))
return text
def unicodeToHTMLEntities(text):
"""Converts unicode to HTML entities. For example '&' becomes '&'."""
text = cgi.escape(text).encode('ascii', 'xmlcharrefreplace')
return text
text = "&, ®, <, >, ¢, £, ¥, €, §, ©"
uni = HTMLEntitiesToUnicode(text)
htmlent = unicodeToHTMLEntities(uni)
print uni
print htmlent
# &, ®, <, >, ¢, £, ¥, €, §, ©
# &, ®, <, >, ¢, £, ¥, €, §, ©
Python 2.7 및 BeautifulSoup4 업데이트
Unescape-유니 코드 HTML을 htmlparser
(Python 2.7 표준 lib) 로 유니 코드합니다 .
>>> escaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from HTMLParser import HTMLParser
>>> htmlparser = HTMLParser()
>>> unescaped = htmlparser.unescape(escaped)
>>> unescaped
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print unescaped
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
Unescape- bs4
(BeautifulSoup4) 로 유니 코드 할 유니 코드 HTML :
>>> html = '''<p>Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood</p>'''
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(html)
>>> soup.text
u'Monsieur le Cur\xe9 of the \xabNotre-Dame-de-Gr\xe2ce\xbb neighborhood'
>>> print soup.text
Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood
이스케이프- bs4
(BeautifulSoup4)로 HTML을 유니 코드하는 유니 코드 :
>>> unescaped = u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
>>> from bs4.dammit import EntitySubstitution
>>> escaper = EntitySubstitution()
>>> escaped = escaper.substitute_html(unescaped)
>>> escaped
u'Monsieur le Curé of the «Notre-Dame-de-Grâce» neighborhood'
As hekevintran answer suggests, you may use cgi.escape(s)
for encoding stings, but notice that encoding of quote is false by default in that function and it may be a good idea to pass the quote=True
keyword argument alongside your string. But even by passing quote=True
, the function won't escape single quotes ("'"
) (Because of these issues the function has been deprecated since version 3.2)
It's been suggested to use html.escape(s)
instead of cgi.escape(s)
. (New in version 3.2)
Also html.unescape(s)
has been introduced in version 3.4.
So in python 3.4 you can:
- Use
html.escape(text).encode('ascii', 'xmlcharrefreplace').decode()
to convert special characters to HTML entities. - And
html.unescape(text)
for converting HTML entities back to plain-text representations.
If someone like me is out there wondering why some entity numbers (codes) like ™ (for trademark symbol), € (for euro symbol)
are not encoded properly, the reason is in ISO-8859-1 (aka Windows-1252) those characters are not defined.
Also note that, the default character set as of html5 is utf-8 it was ISO-8859-1 for html4
So, we will have to workaround somehow (find & replace those at first)
Reference (starting point) from Mozilla's documentation
https://developer.mozilla.org/en-US/docs/Web/Guide/Localizations_and_character_encodings
I used the following function to convert unicode ripped from an xls file into a an html file while conserving the special characters found in the xls file:
def html_wr(f, dat):
''' write dat to file f as html
. file is assumed to be opened in binary format
. if dat is nul it is replaced with non breakable space
. non-ascii characters are translated to xml
'''
if not dat:
dat = ' '
try:
f.write(dat.encode('ascii'))
except:
f.write(html.escape(dat).encode('ascii', 'xmlcharrefreplace'))
hope this is useful to somebody
참고URL : https://stackoverflow.com/questions/701704/convert-html-entities-to-unicode-and-vice-versa
'Development Tip' 카테고리의 다른 글
로컬 워드 프레스 설치는 홈 페이지 만 표시하고 다른 모든 페이지는 찾을 수 없음 (0) | 2020.11.26 |
---|---|
JSX에서 어떻게 className을 string + {prop}로 설정합니까? (0) | 2020.11.26 |
Linux 바이너리 설치 프로그램 (.bin, .sh)은 어떻게 작동합니까? (0) | 2020.11.26 |
NSURLConnection 시간 초과? (0) | 2020.11.26 |
GIT 복원 마지막 분리 HEAD (0) | 2020.11.25 |