Sunday, May 10, 2009

Converting HTML to text using Python

html2text is a Python script that does a good job in extracting text from HTML files. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text: it produces markdown that would then have to be turned into plain text.

2 comments:

Heather said...

I tried to use the HTML2test.py, but not sure which def to call to convert the page I download. Would you help me??
Thanks!

g0bzer said...

check the last line of code, it's html2text(data,baseurl)