- To: "Sean Murphy" <slug@xxxxxxxxxx>
- Subject: Re: [SLUG] Extracting URL's from a web page
- From: Zhasper <slug@xxxxxxxxxxx>
- Date: Mon, 23 Jul 2007 12:31:53 +1000
- Cc: slug@xxxxxxxxxxx
- Dkim-signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=UiR54qBuetQt4YlvbvUPjvV2pLKJtp3iN0yNtKdvAaEx/U+iEU0ck7ta40VVVgPSqxfp4vpBhB2aPQsuLVN2JKILysEoadu780uqYxu8C1wobLJf6DBLYYIOcZ0B8KT06Ay/FWvG5T+AlJcfV7uC0v4j6wks3r3tyy4B/FtcYuo=
- Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=eg/zzcVyDKC3HmGkxwhCDLhHl6qAhSECBGkYNfwFABhF3vfTom4ZUYz6TlHR38KB4ex0Dcfr/CK0p6g/MJ53QTSWU8ZdKrzfh93JBUEBctD4QdnsTonZxqE1FAXltPp4kfOtlWhN+FEhLBAKemxp7k2NWpYwozzvEfhR2iKzUME=
Once more, this time to list
http://aspn.activestate.com/ASPN/docs/ActivePython/2.5/diveintopython/html/html_processing/extracting_data.html
has some sample recipes that should give you a good starting point.
On 22/07/07, Sean Murphy <slug@xxxxxxxxxx> wrote:
All.
I wish to extract specific links from a web page. A part of the requirement
is to be able to drill down about three to four levels to extract the
information.
The first page shall have 26 to 30 links I want to extract. The levels
beneath the first page is a lot higher.
I only want the text, not the underlying HTML code, but if I get the URL
path I can deal.
Wget grabs to much information for my use. I only know the higher levels of
Perl and I am starting to learn Ruby. So my coding under Linux is not very
advance at all.
Sean Murphy
Skype: smurf20005
Life is a challenge, treat it that way.
--
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
--
There is nothing more worthy of contempt than a man who quotes himself
- Zhasper, 2004