Sunday, May 10, 2009

Converting HTML to text using Python

html2text is a Python script that does a good job in extracting text from HTML files. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text: it produces markdown that would then have to be turned into plain text.


Heather said...

I tried to use the, but not sure which def to call to convert the page I download. Would you help me??

g0bzer said...

check the last line of code, it's html2text(data,baseurl)