Hi, all I've accidentally found vulnerability in clean_html function of lxml python library. User can break schema of url with nonprinted chars (\x01-\x08). Seems like all versions including the latest 3.3.4 are vulnerable. Here is PoC. from lxml.html.clean import clean_html html = '''\
aaa bbb bbb bbb bbb bbb bbb bbb bbb bbb ''' print clean_html(html) Output: I've emailed lxml-guys. Hope they'll fix it soon. ---- ksimka (@m_ksimka)