URL Monitor - Getting Notified When a Web Page is Updated

2008-04-05 17:42

I've been using RSS Generator for a while to generate RSS for web pages which don't provide RSS. However, the service often goes unreliable probably due to enourmous load from various RSS readers. Another caveat was that the URL of the generated RSS is so long that it's not accepted by some web-based RSS readers.

So, I rather chose to write a simple shell script which sends me an e-mail message when the web pages in my watch list change. It's name is 'URL Monitor':
#!/bin/sh
# Path: /usr/local/bin/url-monitor

mkdir -p /var/cache/url-monitor
cat /etc/url-monitor.conf | while read -r NAME; do
read -r URL || exit 1
read -r INTERVAL || exit 2
read -r STRIP_REGEX || exit 3
read -r NEEDLE || exit 4
read -r REPLACEMENT || exit 5

if [ -f "/var/cache/url-monitor/$NAME.html" ]; then
MTIME=`stat --format=%Z "/var/cache/url-monitor/$NAME.html"`
NOW=`date +%s`
AGE=$(($NOW - $MTIME))
if [ $AGE -lt $INTERVAL ]; then
continue;
fi
fi

wget -q -T 10 -O - "$URL" | perl -pi -e 's/[\r\n]/ /g' | perl -pi -e "s/$STRIP_REGEX//gi" | perl -pi -e 's/\s+/ /g' | perl -pi -e "s/$NEEDLE/$REPLACEMENT/gi" > "/var/cache/url-monitor/$NAME.html.new"

if [ ! -f "/var/cache/url-monitor/$NAME.html.new" ] || [ `stat --format=%s "/var/cache/url-monitor/$NAME.html.new"` == "0" ]; then
echo "Failed to fetch - $NAME" >&2
exit 6
fi

if [ -f "/var/cache/url-monitor/$NAME.html" ]; then
diff -q "/var/cache/url-monitor/$NAME.html" "/var/cache/url-monitor/$NAME.html.new" > /dev/null 2>&1
if [ "$?" == "0" ]; then
rm -f "/var/cache/url-monitor/$NAME.html.new"
touch "/var/cache/url-monitor/$NAME.html"
continue
else
mv -f "/var/cache/url-monitor/$NAME.html.new" "/var/cache/url-monitor/$NAME.html"
fi
else
mv -f "/var/cache/url-monitor/$NAME.html.new" "/var/cache/url-monitor/$NAME.html"
fi

# Send notification
{
echo 'From: URL Monitor <url-monitor@gleamynode.net>'
echo 'To: Trustin Lee <trustin@gmail.com>'
echo "Subject: $NAME - updated"
echo 'Content-Type: text/html'
echo
cat "/var/cache/url-monitor/$NAME.html"
echo
} | sendmail trustin
done

This quick and dirty shell script simply strips out unnecessary part from the fetched web page, caches it and notifies me (local user 'trustin') via an e-mail when the newly fetched stuff differs from the cached one. The following is the example configuration file (/etc/url-monitor.conf):
JavaWorld: Featured Tutorials
http://www.javaworld.com/features/index.html
86400
(^.*<div id="toplist">|<p><a class="red".*$)
\/javaworld\/
http:\/\/www.javaworld.com\/javaworld\/
DDJ.com: High Performance Computing
http://www.ddj.com/hpc-high-performance-computing/archives.jhtml
86400
(^.*Feature Articles\s*-->|<br clear="left">.*$)
\/hpc-high-performance-computing\/
http:\/\/www.ddj.com\/hpc-high-performance-computing\/
TicketLink: Concerts
http://concert.ticketlink.co.kr/place/map.jsp?area=001
600
(^.*<div id="index_list">|<\/div>.*$)
javascript:fncDetail\(([^',]*),?'([^']+)'\)
http:\/\/concert.ticketlink.co.kr\/detail\/place_end01.jsp?flag=$1&pro_cd=$2&area=001&curpane=1&ria_open=yes
Lono.pe.kr
http://lono.pe.kr/src/
86400
(^.*\[\[Start\]\]-->|<!--\[\[.*$)
\/src\/
http:\/\/www.lono.pe.kr\/src\/
Each line has the following meaning:
  • 1st line - the subject of the web page
  • 2nd line - revisit interval (in seconds)
  • 3rd line - what to strip out (in regex)
  • 4th line - something to replace (in regex), probably relative URLs
  • 5th line - what you want to replace the expression specified in the 4th line with
Once configured, url-monitor should be executed periodically. I added the following line to my crontab:
# Path: /etc/cron.d/url-monitor.cron
SHELL=/bin/bash
PATH=/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=trustin
HOME=/root

# Run the URL monitor every three minutes
*/3 * * * * root /usr/local/bin/url-monitor
As you noticed, it's very primitive and requires you to modify the script itself to configure certain parameters. However, I think it's just OK as long as the number of the web pages I have to monitor (read: which doesn't provide RSS) is small.
---

Comments

1 Comments

 
  • Preview 버튼 누르고 reCAPTCHA 입력 후 Submit 버튼까지 눌러야 실제로 게시됩니다.
  • Make sure to answer the reCAPTCHA and click the Submit button to get your comment posted. It's not enough to click the Preview button only! -- See why.
---