World Library  
Flag as Inappropriate
Email this Article

Comparison of HTML parsers

Article Id: WHEBN0036625178
Reproduction Date:

Title: Comparison of HTML parsers  
Author: World Heritage Encyclopedia
Language: English
Subject: HTML
Publisher: World Heritage Encyclopedia

Comparison of HTML parsers

Parsing HTML is an automated task, performed by (so called) HTML parsers. They have two main purposes:

  • HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers.
  • HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser License Implementation language(s) Latest date* HTML Parsing[1] Clean HTML** Update HTML***
Html Agility Pack Microsoft Public License C# 2012-08-07[2] Yes No ?
Beautiful Soup (base on lxml and html5lib)[3] Python S. F. L. Python 2013-05-31 Yes ? ?
Gumbo Apache License 2.0 C 2013-08-13 Yes ? ?
html5lib MIT License Python (and PHP, six years ago) 2013-12-23[4] Yes Yes No
HTML::Parser Perl license Perl 2013-03-28 No[5] ? ?
htmlPurifier GNU Lesser GPL PHP 2009-03-25[6] No Yes Yes
HTML Tidy W3C license ANSI C 2009-03-25[7] Yes[8] Yes ?
HtmlCleaner BSD License[9] Java 2013-09-05 No Yes ?
Hubbub MIT License C 2013-04-19 Yes ? ?
Jaunt API Jaunt Beta License Java 2013-08-01 Yes Yes No
Jericho HTML Parser Eclipse Public License Java 2012-10-30[10] No?? ? ?
jsdom MIT license JavaScript 2013-07-21 No ? ?
jsoup MIT license Java 2014-09-27[11] Yes Yes Yes
JTidy JTidy License Java 2012-10-09[12] Yes Yes ?
libxml2 HTMLparser MIT License C 2012-09-11[13] Yes ? ?
NekoHTML Apache License 2.0 Java 2013-02-27[14] No ? ?
TagSoup Apache License 2.0 Java 2011-07-07 No ? ? HTML Parser MIT License Java 2012-06-05 Yes ? ?
PHP Simple HTML DOM Parser MIT License PHP 2014-08-28 Yes No No
The PHP DOMDocument-Class (computer programming) PHP License PHP 2014-10-04 Yes No No
Parser License Implementation language(s) Latest date* HTML Parsing Clean HTML** Update HTML***
* Latest release (of significant changes) date.
** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
*** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").


  1. ^ 12.2 Parsing HTML documents — HTML Standard
  2. ^ Nuget Html AgilityPack
  3. ^
  4. ^ Releases · html5lib/html5lib-python
  5. ^ Bug #53300 for HTML-Parser: HTML 5
  6. ^ HTML Tidy for Windows
  7. ^ HTML Tidy for Windows
  8. ^ Tidy parser example: class.tidynode of PHP
  9. ^ HtmlCleaner is distributed under BSD License
  10. ^ Jericho HTML Parser - Browse /jericho-html/3.3 at
  11. ^ jsoup release 1.8.1
  12. ^ JTidy - Browse /JTidy at
  13. ^ libxml2 Releases
  14. ^ NekoHTML | Change History
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.

Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.