Related to the previous article step 1 of a resilient WordPress setup is to mirror the web page somehow. A 2nd WordPress with file level synchronization of the WordPress directory and MySQL multi-master replication sounds great…not.
WordPress keeps files in its directory tree (YAPB pictures as examples, and plugins) and while rsync could handle this, it gets messy quickly. Multi-master MySQL is possible. Overkill for my purpose though.
The easier and more universal way would be to simply grab the web page and keep a static content available. While it’s missing the ability to log in and edit/write articles, that’s fine as most readers will simply read.
The initial idea was to use wget, however that failed a bit: at first it did not (by default) copy JavaScript pages in <script> tags, and it did not download CSS files either, so the pure text contents plus some formatting was copied, but not enough to make it a “mirror”. Another idea was to use PhantomJS to copy and render a picture of the web page. But for a blog that would be a quite long picture, so that idea was thrown out quickly. Going back to the wget method I found httrack, and while not perfect on the first try, httrack made a much better copy out-of-the-box, and with a bit of tuning I could mirror my blog page quite well. While there are differences visible, they are minimal. So httrack it was.
Naturally this became a Docker container. It’s hosted on Docker Hub under hkubota/webmirror.
The Dockerfile is simple:
FROM debian:8 MAINTAINER Harald Kubota <[email protected]> RUN apt-get update ; apt-get -y install lighttpd wget curl openssh-client ; apt-get clean # httrack in usr/local/ COPY usr/ /usr/ RUN ldconfig -v # The script to run COPY mirror.sh /root/ # The lighttpd configuration COPY lighttpd.conf /root/ ENTRYPOINT ["dumb-init", "/root/mirror.sh"] # It's a web server, so expose port 80 EXPOSE 80 WORKDIR /root
It’s using mainly httrack which I compiled from sources, and lighttpd as a web server since I need to export the web pages via a web server again. wget, curl and openssh-client are more for completeness as I was testing with httrack and wget and ssh’ing out.
I tested it on several other web pages (www.heise.de, www.theregister.co.uk and some other ones) and it works quite well. Note that defaults are 2 recursive levels, which allow for anything to be clickable on the first page. Also after 5min the copying stops as I had endless loops happening sometimes. If your network bandwidth is very fast or slow, you might have to adjust this.
To run the hkubota/webmirror Docker image, do:
docker run -e web_source="http://www.heise.de/" -e recursive=2 -e refresh=24 max_time=300 -e other_flags="-v" -p 80:80 -d hkubota/webmirror
If you want to watch what happens (mainly to see httrack output), replace the “-d” by “-it” or watch the logs via “docker logs CONTAINER”.
Next is the actual load balancer…