SLUG Mailing List Archives
Re: [Re: [SLUG] How to Use
- To: Sydney Linux Users Group <slug@xxxxxxxxxxx>
- Subject: Re: [Re: [SLUG] How to Use
- From: lukekendall@xxxxxxxxxxxxxxxx
- Date: Sat, 24 May 2003 15:55:48 +1000 (EST)
On 24 May, Matthew Palmer wrote:
> > Louis> I just tried "wget -r http://www.domain.com/dir/" suggestion, and the
> > following happened:
> > 1. Created a dir called "www.domain.com/dir";
> > 2. In "www.domain.com/dir", I only see the index.html file.
> Does the index.html file have any links to other pages on the same site? If
> not, then wget has done it's job - it has taken a recursive snapshot of the
> If index.html has links to other files residing on the same site, then we do
> have problems, and if you could furnish us with more info on the structure
> of the site (including it's actual location, if feasible) we might be able
> to give more hints.
Louis, it sounds like what you did *should* have worked, for most web
Here's a script I use to snapshot a web site that I and several other
people maintain. I keep it a local copy under CVS control. This way,
no information can be lost.
It won't help you much, though, if the site doesn't let you do a wget -r.
# Take a snapshot of a web site, update local files under cvs control,
# make suggestions about adding files if new ones appear,
# and output the results in a simple summarised form.
# Designed to be run from cron if necessary, even by root.
# Author: Luke Kendall
# Copyright (C) Luke Kendall 2002, but hereby placed into the public domain.
USAGE="usage: $MYNAME [-owner logname] [-dir local-directory] [-quota <number>M]
-dir local-directory is the directory where the site should be copied to.
This should be specified *after* -owner, if a different owner is
-owner logname is the login name of the person considered to be the owner
of the files being downloaded. Useful if you want the downloads
to be done by root. This option is only useful for root.
Defaults to login name of the user.
-quota You have to specify a that wget understands. The default is 50M,
meaning 50 megabytes.
-url You have to provide a URL. Leave off the http://.
urlsnapshot -owner luke -dir smrg -url www.geocities.com/sydsmrg/"
echo "$USAGE" >&2
while [ $# != 0 ]
[ $# -lt 2 ] && usage
case $1 in
for LOCAL_COPY in $HOME_DIR/$2 ./$2
if [ -d "$LOCAL_COPY" ]
HOME_DIR=`grep "^$OWNER:" /etc/passwd | cut -d : -f 6`
if [ -z "$HOME_DIR" ]
echo "$MYNAME: $OWNER not found in /etc/passwd" >&2
if [ "$2" = "0" -o "$2" = "0M" ]
echo "$MYNAME: unknown options $*" >&2
[ $# != 0 ] && usage
[ -z "$SOURCE_URL" ] && usage
cd "$LOCAL_COPY" || exit 1
if [ ! -w . ]
echo "$MYNAME: no write permission in `pwd`" >&2
# Capture normal output and overwrite nightly log.
# Error output is allowed through unchanged.
# Append to log file; non-verbose; 50Mb quota; don't recurse
# upward into the parent directory; recursive fetch.
wget -a $LOCAL_COPY/sms.log -nv $QUOTA -np -r "$SOURCE_URL"
chown -R $OWNER .
WHEN=`date "+%a %e %b %Y"`
# cvs requires you to be su-ed to root, or at least LOGNAME
# set to be your real account.
cvs update > /tmp/sms$$ 2>&1
cvs commit -m "Nightly snapshot for $WHEN"
} > $LOCAL_COPY/sms.log
# Report less-significant changes:
if grep -v "^?" /tmp/sms$$ > /dev/null
echo "Changed files:"
grep -v "^?" /tmp/sms$$
if grep "^[?MP]" /tmp/sms$$ > /dev/null
if grep "^?" /tmp/sms$$ > /dev/null
echo "New files now in local cache appear prefixed with \`?':"
grep "^?" /tmp/sms$$
You should consider doing this:
while read status fname
if [ "x$status" = "x?" ]
if file "$fname" | grep data > /dev/null
elif file "$fname" | grep text > /dev/null
# Don't do this, below. It would mean that every run, wget
# would see that the source file was bigger, and download
# it again. We'd have to replicate the directory elsewhere,
# and use that altered copy for cvs control and uploads.
# Tidy up any freshly-fetched file.
#ex - "$fname" <<-'EOF'
# /<!-- text below generated by server. PLEASE REMOVE/,$d
echo "# This one seems weird: `file \"$fname\"`"
echo " cvs add $bin '$fname'"
elif [ "x$status" = "xM" ]
echo "# $fname has been automatically committed."
echo "# It is assumed that you uploaded the changed file."
done < /tmp/sms$$
echo " cvs diff"
echo " cvs commit"
echo "No new files on web site"
rm -f /tmp/$MYNAME.log
mv /tmp/sms$$ /tmp/$MYNAME.log
chown $OWNER /tmp/$MYNAME.log