ManticMoo.COM -> Jeff's Articles -> Programming Articles -> Using a web proxy to rewrite pages

Using a web proxy to rewrite pages

by Jeffrey P. Bigham

There are a number of reasons you might want to rewrite webpages transparently on the fly. Using a proxy allows these changes to be both completely cross-platform and easy to support. This is in sharp contrast to the other obvious alternative - a browser plugin.

The solution here uses the Apache webserver configured as a forward proxy along with the mod_proxy_html Apache module.

Apache

The first step is to download and install the Apache webserver.
It can be obtained from here: http://httpd.apache.org/.

Configuration & Installation

Next, configure, make and install as listed below. The proxy and headers modules need to be enabled. When in the apache directory:


./configure --enable-proxy --enable-headers
make
make install

Installing mod_proxy_html

The next step is to install mod_proxy_html. Instructions from the original author can be found here, although, it's a bit sparse on the details of how you might alter this for your own purposes. Basically, you'll need to download the source and compile it with the command below:


{PREFIX}/apache/bin/apxs -c -I {LIBXML2 PATH}/include/libxml2/ -i mod_proxy_html.c

You'll notice that this requires libxml2, which can be downloaded from here: http://xmlsoft.org/downloads.html

Configuring Apache

Configuring Apache to act as a proxy and to use your new module takes only a few directives. Place the following items in the appropriate place in the httpd.conf file that came with your Apache distribution.


Listen 8000

...

LoadFile /usr/lib/libxml2.so
LoadModule proxy_html_module modules/mod_proxy_html.so

...

<IfModule mod_proxy.c>
 ProxyRequests On
 SetOutputFilter proxy-html
</IfModule>

And that's basically it! The first directive tells apache to listen for proxy reqeusts on port 8000, the next pair tells Apache to link in the libxml2 library that is used by the mod_proxy_html module and loads the mod_proxy_html module, and the final group turns on the forward proxy and finally tells Apache to filter output through the mod_proxy_html module.

Filtering HTML

Actually filtering HTML content using the mod_proxy_html module requires some C programming, but given the framework exposed by the module it's fairly straightforward.

ManticMoo.COM -> Jeff's Articles -> Programming Articles -> Using a web proxy to rewrite pages