So, I rather chose to write a simple shell script which sends me an e-mail message when the web pages in my watch list change. It's name is 'URL Monitor':
#!/bin/sh
# Path: /usr/local/bin/url-monitor
mkdir -p /var/cache/url-monitor
cat /etc/url-monitor.conf | while read -r NAME; do
read -r URL || exit 1
read -r INTERVAL || exit 2
read -r STRIP_REGEX || exit 3
read -r NEEDLE || exit 4
read -r REPLACEMENT || exit 5
if [ -f "/var/cache/url-monitor/$NAME.html" ]; then
MTIME=`stat --format=%Z "/var/cache/url-monitor/$NAME.html"`
NOW=`date +%s`
AGE=$(($NOW - $MTIME))
if [ $AGE -lt $INTERVAL ]; then
continue;
fi
fi
wget -q -T 10 -O - "$URL" | perl -pi -e 's/[\r\n]/ /g' | perl -pi -e "s/$STRIP_REGEX//gi" | perl -pi -e 's/\s+/ /g' | perl -pi -e "s/$NEEDLE/$REPLACEMENT/gi" > "/var/cache/url-monitor/$NAME.html.new"
if [ ! -f "/var/cache/url-monitor/$NAME.html.new" ] || [ `stat --format=%s "/var/cache/url-monitor/$NAME.html.new"` == "0" ]; then
echo "Failed to fetch - $NAME" >&2
exit 6
fi
if [ -f "/var/cache/url-monitor/$NAME.html" ]; then
diff -q "/var/cache/url-monitor/$NAME.html" "/var/cache/url-monitor/$NAME.html.new" > /dev/null 2>&1
if [ "$?" == "0" ]; then
rm -f "/var/cache/url-monitor/$NAME.html.new"
touch "/var/cache/url-monitor/$NAME.html"
continue
else
mv -f "/var/cache/url-monitor/$NAME.html.new" "/var/cache/url-monitor/$NAME.html"
fi
else
mv -f "/var/cache/url-monitor/$NAME.html.new" "/var/cache/url-monitor/$NAME.html"
fi
# Send notification
{
echo 'From: URL Monitor <url-monitor@gleamynode.net>'
echo 'To: Trustin Lee <trustin@gmail.com>'
echo "Subject: $NAME - updated"
echo 'Content-Type: text/html'
echo
cat "/var/cache/url-monitor/$NAME.html"
echo
} | sendmail trustin
done
This quick and dirty shell script simply strips out unnecessary part from the fetched web page, caches it and notifies me (local user 'trustin') via an e-mail when the newly fetched stuff differs from the cached one. The following is the example configuration file (/etc/url-monitor.conf):
Each line has the following meaning:JavaWorld: Featured Tutorials
http://www.javaworld.com/features/index.html
86400
(^.*<div id="toplist">|<p><a class="red".*$)
\/javaworld\/
http:\/\/www.javaworld.com\/javaworld\/
DDJ.com: High Performance Computing
http://www.ddj.com/hpc-high-performance-computing/archives.jhtml
86400
(^.*Feature Articles\s*-->|<br clear="left">.*$)
\/hpc-high-performance-computing\/
http:\/\/www.ddj.com\/hpc-high-performance-computing\/
TicketLink: Concerts
http://concert.ticketlink.co.kr/place/map.jsp?area=001
600
(^.*<div id="index_list">|<\/div>.*$)
javascript:fncDetail\(([^',]*),?'([^']+)'\)
http:\/\/concert.ticketlink.co.kr\/detail\/place_end01.jsp?flag=$1&pro_cd=$2&area=001&curpane=1&ria_open=yes
Lono.pe.kr
http://lono.pe.kr/src/
86400
(^.*\[\[Start\]\]-->|<!--\[\[.*$)
\/src\/
http:\/\/www.lono.pe.kr\/src\/
- 1st line - the subject of the web page
- 2nd line - revisit interval (in seconds)
- 3rd line - what to strip out (in regex)
- 4th line - something to replace (in regex), probably relative URLs
- 5th line - what you want to replace the expression specified in the 4th line with
As you noticed, it's very primitive and requires you to modify the script itself to configure certain parameters. However, I think it's just OK as long as the number of the web pages I have to monitor (read: which doesn't provide RSS) is small.# Path: /etc/cron.d/url-monitor.cron
SHELL=/bin/bash
PATH=/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=trustin
HOME=/root
# Run the URL monitor every three minutes
*/3 * * * * root /usr/local/bin/url-monitor