ManticMoo.COM All Articles Jeff's Articles
Jeffrey P. Bigham

Using htmlparser without having it download the page

Jeffrey P. Bigham

Related Ads

htmlparser is an incredibly handy parser, but one thing I didn't like about it at first is that it makes it seem like you have to use the built-in function for downloading the web page. This isn't always ideal because it isn't that flexible and it precludes you from having a cached copy on your local machine in a database or other location not easily accessible as a file or URL. The setInputHTML method of the Parser object would seem to get around most of these complaints, but not quite. If you want to extract the links, image or other objects that can be accessed in a way relative to the base URL, you can't. You'll get URLs back that look like "/images/foo.gif", which you obviously won't be able to download directly. To get around this problem, you must first set the base URL, which isn't exactly straighforward, and then all relative URLs will be correctly rectified.

The code below does all of these things. It first sets the input HTML from a String and then sets the base URL from a provided String.

String url = "http://www.domain.com/foo.html";
String input_html = getHTML(url);

Parser my_parser = new Parser();
my_parser.setInputHTML(input_html);
my_parser.getLexer().getPage().setBaseUrl(url);

NodeList nodes = my_parser.parse(filter);

Obviously, you need to fill in the pieces that actually set the URL and HTML to the right values for your applications, but that's how it works in a nutshell. Note: Be sure not to try to set the Base URL before setting the input HTML. Doing so won't work for some reason. Happy coding!

Jeffrey P. Bigham
ManticMoo.COM All Articles Jeff's Articles