QHtmlParser: writing an HTML parser with your brain switched off
While developing MiTubo I've recently felt
the need of parsing HTML pages: the first problem I wanted to solve was
implementing proper RSS feed detection when the user entered a website URL into
MiTubo's search box, so that MiTubo would parse the site's HTML, look for
<link rel="alternate"...>
URLs in the HEAD
section, and let the user
subscribe to any video feeds found there.
A quick search in the internet did not provide a clear answer: I found a Qt HTML parser in (stalled) development, and a few other C++ or C parsers (among the latters, lexbor is the most inspiring), but all of them seem to take the approach of parsing the HTML file into a DOM tree, while I was hoping to find a lightweight SAX-like parser. Pretty much like Python's html.parser.
Anyway, I don't remember how it happened, but at a certain point I found myself
looking at html.parser
source code, and I was surprised to see how compact it
was (apart, of course, for the long list of character references for the HTML
entities!). Upon a closer look, it also appeared that the code was not making
much use of Python's dynamic typing, so, I thought, maybe I could give it a try
to rewrite that into a Qt class. And a few hours later
QHtmlParser was born.
As this post's title suggests, the process of rewriting html.parser
with Qt
was quite straightforward, and the nice thing about it is that I didn't have to
spend any time reading the HTML standard or trying to figure out how to
implement the parser: I just had to translate Python code into C++ code, and
thanks to the nice API of QString (which in many ways resembles Python's — or
vice versa) this was not too hard. I even left most of the original code
comments untouched, and reused quite a few tests from the test suite.
It was time well spent. :-)
If you think you might need an HTML parser for your Qt application, you are welcome to give it a try. It's not a library, just a set of files that you can import into your project; for the time being I only have a build file for QBS, but I'll happily accept contributions to make it easier to use QHtmlParser with projects built using other build systems. You can see here the changes I made in MiTubo to start using it and detect RSS feed in a webpage's HEAD.
That's all for now. And in case you missed the link before, you can find QHtmlParser here.
Comments
There's also webmention support.