Tonight, I added a new feature to Structr to enable content syndication. You can now fetch content from any URL and integrate it into Structr pages, anywhere (even as part of any html attribute).
Not very innovative at first sight, but it's of course important for a CMS, enabling real content syndication.But that's not the point of this blog post.
Technically, for content syndication you need to make an HTTP request to a remote URL, get the content and either display it unfiltered, or decide what to do with it otherwise.
As I had promised Stephan Tilkov, I wanted to test Jerry, a Java library for HTML parsing, traversing and manipulating, against Jsoup (already used for the Importer functionality of Structr).
And this is the result:
First test was to get and parse a remote page of typical size and speed. As I didn't want to flood anyone else, I chose http://structr.org/about.
Jsoup comes with a convenient interface where getting and parsing a web page is just a one-liner:
return Jsoup.parse(new URL(address), 5000).html();
This will fetch the content of
address, parse it as an HTML document, and return the html source code.As Jerry seemed to not the same interface, I implemented a short helper method like that:
DefaultHttpClient client = new DefaultHttpClient(); HttpGet get = new HttpGet(requestUrl); get.setHeader("Connection", "close"); return IOUtils.toString(client.execute(get).getEntity().getContent(), "UTF-8");
The output of above method was then passed to Jerry: String raw = getFromUrl(address); Jerry doc = jerry(raw); return doc.html();
Regarding these charts, Jsoup seemed like the clear winner. But I had some timing logging enabled, and as the logs showed the following, I wanted to give Jerry another chance.
INFO: Jsoup took 35 ms to get and parse page.
INFO: Jerry took 87 ms to get and 7 ms to parse page.
I read on their pages that there was an HTTP client available in the underlying Jodd library. So I exchanged Apache HttpGet for the Jodd HTTP methods:
HttpRequest httpRequest = HttpRequest.get(requestUrl); HttpResponse response = httpRequest.send(); return response.body();
But that changed not too much:
INFO: Jerry took 80 ms to get and 1 ms to parse page.
Jerry comes with a syntax for selecting elements nearly identical to jQuery, which is great. Jsoup has a similar approach, featuring a powerful but proprietary syntax.
In the end, Jerry clearly lacks speed: In my tests, Jsoup was as twice as fast at retrieving the raw page content, making the parsing time negligible.