html2text is a Python script that does a good job in extracting text from HTML files. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text: it produces markdown that would then have to be turned into plain text.
Sunday, May 10, 2009
Subscribe to:
Post Comments (Atom)
2 comments:
I tried to use the HTML2test.py, but not sure which def to call to convert the page I download. Would you help me??
Thanks!
check the last line of code, it's html2text(data,baseurl)
Post a Comment