Python 문자열에서 허용 목록에없는 HTML 태그 제거

Development Tip

Python 문자열에서 허용 목록에없는 HTML 태그 제거

yourdevel 2020. 11. 9. 21:16

Python 문자열에서 허용 목록에없는 HTML 태그 제거

텍스트와 HTML이 포함 된 문자열이 있습니다. <script>웹 페이지에서 안전하게 렌더링 할 수 있도록 , 같은 일부 HTML 태그를 제거하거나 비활성화하고 다른 태그는 허용하고 싶습니다. 허용 된 태그 목록이 있습니다. 다른 태그를 제거하기 위해 문자열을 처리하려면 어떻게해야합니까?

BeautifulSoup 을 사용하는 간단한 솔루션은 다음과 같습니다 .

from bs4 import BeautifulSoup

VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br']

def sanitize_html(value):

    soup = BeautifulSoup(value)

    for tag in soup.findAll(True):
        if tag.name not in VALID_TAGS:
            tag.hidden = True

    return soup.renderContents()

당신은뿐만 아니라 대체를 잘못된 태그의 내용을 제거 할 경우 tag.extract()에 대해 tag.hidden.

lxml 및 Tidy 사용을 고려할 수도 있습니다 .

사용 lxml.html.clean! 그건 매우 쉬워요!

from lxml.html.clean import clean_html
print clean_html(html)

다음 html을 가정하십시오.

html = '''\
<html>
 <head>
   <script type="text/javascript" src="evil-site"></script>
   <link rel="alternate" type="text/rss" src="evil-rss">
   <style>
     body {background-image: url(javascript:do_evil)};
     div {color: expression(evil)};
   </style>
 </head>
 <body onload="evil_function()">
    <!-- I am interpreted for EVIL! -->
   <a href="javascript:evil_function()">a link</a>
   <a href="#" onclick="evil_function()">another link</a>
   <p onclick="evil_function()">a paragraph</p>
   <div style="display: none">secret EVIL!</div>
   <object> of EVIL! </object>
   <iframe src="evil-site"></iframe>
   <form action="evil-site">
     Password: <input type="password" name="password">
   </form>
   <blink>annoying EVIL!</blink>
   <a href="evil-site">spam spam SPAM!</a>
   <image src="evil!">
 </body>
</html>'''

결과는 ...

<html>
  <body>
    <div>
      <style>/* deleted */</style>
      <a href="">a link</a>
      <a href="#">another link</a>
      <p>a paragraph</p>
      <div>secret EVIL!</div>
      of EVIL!
      Password:
      annoying EVIL!
      <a href="evil-site">spam spam SPAM!</a>
      <img src="evil!">
    </div>
  </body>
</html>

정리하려는 요소와 기타 요소를 사용자 지정할 수 있습니다.

뷰티플 수프를 통한 위의 솔루션은 작동하지 않습니다. Beautiful Soup은 구문 분석 트리에 대한 액세스를 제공하기 때문에 Beautiful Soup을 사용하여 그 위에있는 무언가를 해킹 할 수 있습니다. 당분간 문제를 제대로 해결하려고 노력할 것 같지만 1 주일 정도의 프로젝트이고 곧 공짜가 없습니다.

구체적으로 말하자면, Beautiful Soup은 위의 코드가 포착하지 못하는 파싱 오류에 대해 예외를 던질뿐만 아니라; 또한 다음과 같이 포착되지 않은 매우 실제적인 XSS 취약점이 많이 있습니다.

<<script>script> alert("Haha, I hacked your page."); </</script>script>

아마도 가장 좋은 방법은 모든 HTML 을 금지 하기 위해 <요소를으로 제거하고 마크 다운과 같은 제한된 하위 집합을 사용하여 형식을 올바르게 렌더링하는 것입니다. 특히, 다시 돌아가서 정규식을 사용하여 일반적인 HTML 비트를 다시 도입 할 수도 있습니다. 대략적인 과정은 다음과 같습니다.<

_lt_     = re.compile('<')
_tc_ = '~(lt)~'   # or whatever, so long as markdown doesn't mangle it.     
_ok_ = re.compile(_tc_ + '(/?(?:u|b|i|em|strong|sup|sub|p|br|q|blockquote|code))>', re.I)
_sqrt_ = re.compile(_tc_ + 'sqrt>', re.I)     #just to give an example of extending
_endsqrt_ = re.compile(_tc_ + '/sqrt>', re.I) #html syntax with your own elements.
_tcre_ = re.compile(_tc_)

def sanitize(text):
    text = _lt_.sub(_tc_, text)
    text = markdown(text)
    text = _ok_.sub(r'<\1>', text)
    text = _sqrt_.sub(r'&radic;<span style="text-decoration:overline;">', text)
    text = _endsqrt_.sub(r'</span>', text)
    return _tcre_.sub('&lt;', text)

아직 코드를 테스트하지 않았으므로 버그가있을 수 있습니다. 그러나 일반적인 아이디어를 볼 수 있습니다. ok 항목을 화이트리스트에 추가하기 전에 일반적으로 모든 HTML을 블랙리스트에 추가해야합니다.

다음은 내 프로젝트에서 사용하는 것입니다. Acceptable_elements / attributes는 feedparser 에서 나오며 BeautifulSoup이 작업을 수행합니다.

from BeautifulSoup import BeautifulSoup

acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', 'big',
      'blockquote', 'br', 'button', 'caption', 'center', 'cite', 'code', 'col',
      'colgroup', 'dd', 'del', 'dfn', 'dir', 'div', 'dl', 'dt', 'em',
      'font', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'hr', 'i', 'img', 
      'ins', 'kbd', 'label', 'legend', 'li', 'map', 'menu', 'ol', 
      'p', 'pre', 'q', 's', 'samp', 'small', 'span', 'strike',
      'strong', 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th',
      'thead', 'tr', 'tt', 'u', 'ul', 'var']

acceptable_attributes = ['abbr', 'accept', 'accept-charset', 'accesskey',
  'action', 'align', 'alt', 'axis', 'border', 'cellpadding', 'cellspacing',
  'char', 'charoff', 'charset', 'checked', 'cite', 'clear', 'cols',
  'colspan', 'color', 'compact', 'coords', 'datetime', 'dir', 
  'enctype', 'for', 'headers', 'height', 'href', 'hreflang', 'hspace',
  'id', 'ismap', 'label', 'lang', 'longdesc', 'maxlength', 'method',
  'multiple', 'name', 'nohref', 'noshade', 'nowrap', 'prompt', 
  'rel', 'rev', 'rows', 'rowspan', 'rules', 'scope', 'shape', 'size',
  'span', 'src', 'start', 'summary', 'tabindex', 'target', 'title', 'type',
  'usemap', 'valign', 'value', 'vspace', 'width']

def clean_html( fragment ):
    while True:
        soup = BeautifulSoup( fragment )
        removed = False        
        for tag in soup.findAll(True): # find all tags
            if tag.name not in acceptable_elements:
                tag.extract() # remove the bad ones
                removed = True
            else: # it might have bad attributes
                # a better way to get all attributes?
                for attr in tag._getAttrMap().keys():
                    if attr not in acceptable_attributes:
                        del tag[attr]

        # turn it back to html
        fragment = unicode(soup)

        if removed:
            # we removed tags and tricky can could exploit that!
            # we need to reparse the html until it stops changing
            continue # next round

        return fragment

이것이 올바르게 작동하는지 확인하기위한 몇 가지 작은 테스트 :

tests = [   #text should work
            ('<p>this is text</p>but this too', '<p>this is text</p>but this too'),
            # make sure we cant exploit removal of tags
            ('<<script></script>script> alert("Haha, I hacked your page."); <<script></script>/script>', ''),
            # try the same trick with attributes, gives an Exception
            ('<div on<script></script>load="alert("Haha, I hacked your page.");">1</div>',  Exception),
             # no tags should be skipped
            ('<script>bad</script><script>bad</script><script>bad</script>', ''),
            # leave valid tags but remove bad attributes
            ('<a href="good" onload="bad" onclick="bad" alt="good">1</div>', '<a href="good" alt="good">1</a>'),
]

for text, out in tests:
    try:
        res = clean_html(text)
        assert res == out, "%s => %s != %s" % (text, res, out)
    except out, e:
        assert isinstance(e, out), "Wrong exception %r" % e

Bleach 는 더 유용한 옵션으로 더 잘 작동합니다. html5lib를 기반으로 제작되었으며 생산 준비가되었습니다. bleack.clean기능에 대한 문서를 확인하십시오 . 기본 구성은 안전하지 않은 태그를 이스케이프 <script>하면서 <a>.

import bleach
bleach.clean("<script>evil</script> <a href='http://example.com'>example</a>")
# '&lt;script&gt;evil&lt;/script&gt; <a href="http://example.com">example</a>'

Chris Drost가 제기 한 문제 를 해결하기 위해 BeautifulSoup 으로 Bryan 의 솔루션을 수정했습니다 . 약간 조잡하지만 작업을 수행합니다.

from BeautifulSoup import BeautifulSoup, Comment

VALID_TAGS = {'strong': [],
              'em': [],
              'p': [],
              'ol': [],
              'ul': [],
              'li': [],
              'br': [],
              'a': ['href', 'title']
              }

def sanitize_html(value, valid_tags=VALID_TAGS):
    soup = BeautifulSoup(value)
    comments = soup.findAll(text=lambda text:isinstance(text, Comment))
    [comment.extract() for comment in comments]
    # Some markup can be crafted to slip through BeautifulSoup's parser, so
    # we run this repeatedly until it generates the same output twice.
    newoutput = soup.renderContents()
    while 1:
        oldoutput = newoutput
        soup = BeautifulSoup(newoutput)
        for tag in soup.findAll(True):
            if tag.name not in valid_tags:
                tag.hidden = True
            else:
                tag.attrs = [(attr, value) for attr, value in tag.attrs if attr in valid_tags[tag.name]]
        newoutput = soup.renderContents()
        if oldoutput == newoutput:
            break
    return newoutput

편집 : 유효한 속성을 지원하도록 업데이트되었습니다.

I use FilterHTML. It's simple and lets you define a well-controlled white-list, scrubs URLs and even matches attribute values against regex or have custom filtering functions per attribute. If used carefully it could be a safe solution. Here's a simplified example from the readme:

import FilterHTML

# only allow:
#   <a> tags with valid href URLs
#   <img> tags with valid src URLs and measurements
whitelist = {
  'a': {
    'href': 'url',
    'target': [
      '_blank',
      '_self'
    ],
    'class': [
      'button'
    ]
  },
  'img': {
    'src': 'url',
    'width': 'measurement',
    'height': 'measurement'
  },
}

filtered_html = FilterHTML.filter_html(unfiltered_html, whitelist)

You could use html5lib, which uses a whitelist to sanitize.

An example:

import html5lib
from html5lib import sanitizer, treebuilders, treewalkers, serializer

def clean_html(buf):
    """Cleans HTML of dangerous tags and content."""
    buf = buf.strip()
    if not buf:
        return buf

    p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"),
            tokenizer=sanitizer.HTMLSanitizer)
    dom_tree = p.parseFragment(buf)

    walker = treewalkers.getTreeWalker("dom")
    stream = walker(dom_tree)

    s = serializer.htmlserializer.HTMLSerializer(
            omit_optional_tags=False,
            quote_attr_values=True)
    return s.render(stream)

I prefer the lxml.html.clean solution, like nosklo points out. Here's to also remove some empty tags:

from lxml import etree
from lxml.html import clean, fromstring, tostring

remove_attrs = ['class']
remove_tags = ['table', 'tr', 'td']
nonempty_tags = ['a', 'p', 'span', 'div']

cleaner = clean.Cleaner(remove_tags=remove_tags)

def squeaky_clean(html):
    clean_html = cleaner.clean_html(html)
    # now remove the useless empty tags
    root = fromstring(clean_html)
    context = etree.iterwalk(root) # just the end tag event
    for action, elem in context:
        clean_text = elem.text and elem.text.strip(' \t\r\n')
        if elem.tag in nonempty_tags and \
        not (len(elem) or clean_text): # no children nor text
            elem.getparent().remove(elem)
            continue
        elem.text = clean_text # if you want
        # and if you also wanna remove some attrs:
        for badattr in remove_attrs:
            if elem.attrib.has_key(badattr):
                del elem.attrib[badattr]
    return tostring(root)

참고URL : https://stackoverflow.com/questions/699468/remove-html-tags-not-on-an-allowed-list-from-a-python-string

'Development Tip' 카테고리의 다른 글

DataGridView에 선택한 행이 표시되도록하려면 어떻게해야합니까? (0)	2020.11.09
.translate ()를 사용하여 Python 3.x의 문자열에서 구두점을 제거하는 방법은 무엇입니까? (0)	2020.11.09
Oracle의 기본 날짜 형식은 YYYY-MM-DD입니다. 이유는 무엇입니까? (0)	2020.11.09
프로그래밍 이론 : 미로 풀기 (0)	2020.11.09
파이썬에서 URL을 여는 방법 (0)	2020.11.09

현재글Python 문자열에서 허용 목록에없는 HTML 태그 제거

yourdevel

Python 문자열에서 허용 목록에없는 HTML 태그 제거

Python 문자열에서 허용 목록에없는 HTML 태그 제거

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

티스토리툴바

Python 문자열에서 허용 목록에없는 HTML 태그 제거

Python 문자열에서 허용 목록에없는 HTML 태그 제거

'Development Tip' 카테고리의 다른 글

'Development Tip'의 다른글

관련글

티스토리툴바