Hi, all I've accidentally found vulnerability in clean_html function of lxml python library. User can break schema of url with nonprinted chars (\x01-\x08). Seems like all versions including the latest 3.3.4 are vulnerable. Here is PoC. from lxml.html.clean import clean_html html = '''\ aaa bbb bbb bbb bbb bbb bbb bbb bbb bbb ''' print clean_html(html) Output:
aaa bbb bbb bbb bbb bbb bbb bbb bbb bbb
I've emailed lxml-guys. Hope they'll fix it soon. ---- ksimka (@m_ksimka)