Web Scraping for Fun and Profit

Kenytt Avery

<k.avery@computer.org>

The big three

Stop using your web browser

What to use instead

wget

lynx screenshot

                                                  Sys Admin Magazine (p1 of 16)

   [logo_cmp_black.gif]
   [Type=count&AdID=50446&FlightID=31994&TargetID=2715&Segments=1411,1462
   ,1628,3158,3448,3977,4875&Targets=1215,2715,2878&Values=34,46,51,63,77
   ,80,92,101,140,203,442,646,944,945,963,1104,1184,1388,1405,1426,1431,1
   736,1766,1785,1944,1970,2310,2352&RawValues=&random=cjRkbtk,baRdiupcea
   imw]
   [logo_sa_new.jpg] [spacer_319cce.gif]
   ____________________ Search
     [Jump to:_________]
   [spacer_999999.gif]
   [spacer_999999.gif]
   January 2005
   [Type=count&AdID=50783&FlightID=32184&TargetID=2587&Segments=1411,3035
   ,3448,4875&Targets=2587,2878&Values=34,46,51,63,77,80,92,101,140,290,4
   42,646,918,944,945,963,1184,1388,1405,1426,1431,1736,1766,1785,1944,19
   70,2310,2352&RawValues=&random=dfnKozk,baRdiupceaimr]

Feature Article

(NORMAL LINK)   Use right-arrow or  to activate.
  Arrow keys: Up and Down to move.  Right to follow a link; Left to go back.
 H)elp O)ptions P)rint G)o M)ain screen Q)uit /=search [delete]=history list

links screenshot

                                                      Sys Admin Magazine (1/10)

[IMG]                           _____________________           [ Search ]


 January 2005                                              Current Issue

                    Feature Article                       [IMG]
                                                          Table of contents
 Open Source Anti-Virus for the Whole Network: ClamAV     Buy this issue.
 * James Mikusi
                                                           Unix Review Spotlight
 Mikusi provides an overview of the ClamAV anti-virus
 tool, which filters any given input and outputs a        Changes to the CIW
 basic summary stating whether a virus was detected.      Associate Certificatio
                                                          Exam
                        Columns                           The CIW (Certified
                                                          Internet Webmaster)
 Checking Your Bookmarks * Randal L. Schwartz             Foundation exam has
                                                          recently upgraded to
 Questions and Answers * Amy Rich                         version 5. This
                                                          vendor-neutral exam is
 -------------------------------------------------        at the core of all the
http://www.samag.com/

lynx, links, and elinks

websnarf

$ websnarf 'http://docs.sun.com/app/docs/doc/806-2221-10/6jbf1novc?a=view'
Snarfing http://docs.sun.com/app/docs/doc/806-2221-10/6jbf1novc?a=view...to docs.sun.com: Solaris 8 Sun Hardware Platform Guide.txt
$ cat 'docs.sun.com: Solaris 8 Sun Hardware Platform Guide.txt'
http://docs.sun.com/app/docs/doc/806-2221-10/6jbf1novc?a=view

   sun.com       How To Buy  |  My Sun  |  Worldwide Sites  |  Search sun.com

   Sun Microsystems Logo  [IMG]Products and          [IMG]Support and
                          Services                   Training


   docs.sun.com - Sun Product Documentation
...
   Table 1-1 Platform Names for Sun Systems

   +------------------------------------------------------------------------+
   | System                    | Platform Name            | Platform Group  |
   |---------------------------+--------------------------+-----------------|
   | SPARCclassic              | SUNW,SPARCclassic        | sun4m           |
   |---------------------------+--------------------------+-----------------|
   | SPARCstation LX           | SUNW,SPARCstation-LX     | sun4m           |
   |---------------------------+--------------------------+-----------------|
   | SPARCstation LX+          | SUNW,SPARCstation-LX+    | sun4m           |
   |---------------------------+--------------------------+-----------------|
   | SPARCstation 4            | SUNW,SPARCstation-4      | sun4m           |
   |---------------------------+--------------------------+-----------------|
   | SPARCstation 5            | SUNW,SPARCstation-5      | sun4m           |
   |---------------------------+--------------------------+-----------------|

How did we do it?

#!/bin/sh

if [ $# -lt 1 ]; then
        echo "Usage: `basename $0` url..." 1>&2
        exit 1
fi

for url in "$@"; do
	echo -n "Snarfing $url..." 1>&2
	title=`html_title "$url"`
	echo "to $title.txt" 1>&2
	echo "$url" > "$title.txt"
	echo >> "$title.txt"
	elinks -dump "$url" >> "$title.txt"
done

Jumping ahead a bit...

#!/usr/bin/perl -w
use strict;
use warnings;

use File::Basename;
use LWP::Simple;
use HTML::TokeParser;

my $program = basename $0;
my $url = shift or die "Usage: $program url\n";
my $html = get($url) or die "Can't retrieve $url\n";
my $p = HTML::TokeParser->new(\$html);

if ($p->get_tag("title")) {
	print $p->get_trimmed_text, "\n";
} else {
        print $url, "\n";
}

Interacting directly with the server

$ telnet www.freebsd.org 80
Trying 216.136.204.117...
Connected to www.freebsd.org (216.136.204.117).
Escape character is '^]'.
GET /
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content="HTML Tidy, see www.w3.org" />
    <meta http-equiv="Content-Type"
    content="text/html; charset=iso-8859-1" />

    <title>The FreeBSD Project</title></pre></small>
    <meta name="description" content="The FreeBSD Project" />
    <meta name="keywords"
    content="FreeBSD, BSD, UNIX, Support, Gallery, Release, Application, Softwar
e, Handbook, FAQ, Tutorials, Bugs, CVS, CVSup, News, Commercial Vendors, homepage, CTM, Unix" />
...

Elementary HTTP

Why do versions matter?

$ telnet www.uuasc.org 80
Trying 216.237.5.34...
Connected to compata.com (216.237.5.34).
Escape character is '^]'.
GET /
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <title>Compata - Advanced Computer Applications</title>

$ telnet www.uuasc.org 80
Trying 216.237.5.34...
Connected to compata.com (216.237.5.34).
Escape character is '^]'.
GET / HTTP/1.0
Host: www.uuasc.org

HTTP/1.1 200 OK
Date: Sun, 09 Jan 2005 22:37:55 GMT
Server: Apache/2.0.51 (Fedora)
Last-Modified: Wed, 15 Dec 2004 06:49:44 GMT
ETag: "b04563-1181-f14eb200"
Accept-Ranges: bytes
Content-Length: 4481
Connection: close
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en

<!doctype HTML public "-//IETF//DTD HTML//EN">
<HTML>
<head>
<title>UNIX Users Association of Southern California

Proxy servers in a nutshell

LWP + HTTP::Daemon = Simple Proxy Server

$ ./bin/dump-proxy
listening on http://vanadium.sabren.com:42513/
$VAR1 = bless( {
                 '_protocol' => 'HTTP/1.1',
                 '_content' => '',
                 '_uri' => bless( do{\(my $o = 'http://www.uuasc.org/')}, 'URI::http' ),
                 '_headers' => bless( {
                                        'proxy-connection' => 'keep-alive',
                                        'accept-charset' => 'ISO-8859-1,utf-8;q=0.7,*;q=0.7',
                                        'user-agent' => 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0',
                                        'keep-alive' => '300',
                                        'accept' => 'text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
                                        'accept-language' => 'en-us,en;q=0.5',
                                        'accept-encoding' => 'gzip,deflate',
                                        'host' => 'www.uuasc.org'
                                      }, 'HTTP::Headers' ),
                 '_method' => 'GET'
               }, 'HTTP::Request' );
$VAR1 = bless( {
                 '_protocol' => 'HTTP/1.1',
                 '_content' => '<!doctype HTML public "-//IETF//DTD HTML//EN">
<HTML>
<head>
<title>UNIX Users Association of Southern California>/title>
</head>
...
                 '_rc' => '200',
                 '_headers' => bless( {
                                        'client-date' => 'Sun, 09 Jan 2005 23:14:59 GMT',
                                        'etag' => '"b04563-1181-f14eb200"',
                                        'content-type' => 'text/html; charset=ISO-8859-1',
                                        'connection' => 'close',
                                        'client-response-num' => 1,
                                        'last-modified' => 'Wed, 15 Dec 2004 06:49:44 GMT',
                                        'content-language' => 'en',
                                        'accept-ranges' => 'bytes',
                                        'date' => 'Sun, 09 Jan 2005 23:14:59 GMT',
                                        'title' => 'UNIX Users Association of Southern California',
                                        'client-peer' => '216.237.5.34:80',
                                        'content-length' => '4481',
                                        'server' => 'Apache/2.0.51 (Fedora)'
                                      }, 'HTTP::Headers' ),
                 '_msg' => 'OK',

How did we do it?

#!/usr/bin/perl -w

use strict;
use Socket;
use HTTP::Daemon;
use LWP::UserAgent;
use Data::Dumper;

$| = 1;

my $port = $ARGV[0] || 0;
my $daemon = HTTP::Daemon->new(LocalPort => $port, Reuse => 1);
my $agent = LWP::UserAgent->new;

warn "listening on @{[$daemon->url]}\n";

my $conn;
while ($conn = $daemon->accept) {
	my $request = $conn->get_request;
        print Dumper($request);
	my $response = $agent->request($request);
        print Dumper($response);
	$conn->send_response($response);
}

Checking to see whether a page has changed

Note some of the response headers we've seen:
HTTP/1.1 200 OK
Date: Sun, 09 Jan 2005 22:37:55 GMT
Server: Apache/2.0.51 (Fedora)
Last-Modified: Wed, 15 Dec 2004 06:49:44 GMT
ETag: "b04563-1181-f14eb200"
Accept-Ranges: bytes
Content-Length: 4481
Connection: close
Content-Type: text/html; charset=ISO-8859-1
Content-Language: en

Regexes and HTML

Summarizing Slashdot

http://it.slashdot.org/article.pl?sid=05/01/09/2220220&light=1&threshold=5&mode=flat
$page =~ m{</P><H2>(.*?)</H2>.*?<FONT SIZE="2"><B>(.*?)</B></FONT><BR>\s*(.*?)<P>\s*<script LANGUAGE="JAVASCRIPT"}ms;
print "$1\n\n"; # title
print "$2\n\n"; # dept

my $story = $3;
$story =~ s/^\s*//;
$story =~ s/\s*$//;
$story =~ s/<[^>]*>//gs;
print "$story\n\n";

while ($page =~ m{\)</FONT>\s*<P>(.*?)</TD>}msg) {
...

$ ./bin/summarize slashdot-article.html | fmt Classic Gerald Weinberg Essay Reprinted from the talking-to-fatso dept. danielread writes "Programmer abuse has been a popular topic recently, especially within the gaming industry. However, excessive overtime and overwork are not new problems for software professionals. Twenty years ago, acclaimed author Gerald Weinberg wrote an essay called 'Personal Chemistry and the Healthy Body,' which is as relevant for programmers today as it was two decades ago. Given this topic's recent resurgence, Mr. Weinberg was generous enough to let developer.* Magazine reprint this classic essay." ------------------------------ I read the essay, but I couldn't find the passage where it talks about how essential caffeine is to programming. I think I'm going to have to go back and look harder... ...

Parsing HTML

HTML::TokeParser

use HTML::TokeParser;

my $parser = HTML::TokeParser->new($FILENAME)
	or die "Can't open $FILENAME: $!\n";
while (my $token = $parser->get_token( )) {
	my $type = $token->[0];
	if    ($type eq 'S')  { ... }   # start tag
	elsif ($type eq 'E')  { ... }   # end tag
	elsif ($type eq 'T')  { ... }   # text
	elsif ($type eq 'C')  { ... }   # comment
	elsif ($type eq 'D')  { ... }   # declaration
	elsif ($type eq 'PI') { ... }   # processing instruction
	else { die "$type isn't a valid HTML token type" }
}
from Perl Cookbook, Second Edition, O'Reilly and Associates, 2003.

HTML::Parser

use HTML::Parser;

my $program = basename $0;
my $url = shift or die "Usage: $program url\n";
my $html = get($url) or die "Can't retrieve $url\n";

my $found_title = 0;

package TitleParser;
use base 'HTML::Parser';

my $p = TitleParser->new;
$p->parse($html);
$p->eof;
print "$url\n";

sub start {
	my ($self, $tag, $attr) = @_;

	if ($tag eq 'title') {
		$found_title = 1;
	}
}

sub text {
	my ($self, $text) = @_;

	if ($found_title) {
		print "$text\n";
		exit 0;
	}
}

HTML::TreeBuilder

#!/home/kavery/bin/perl -w
use strict;
use warnings;

use File::Basename;
use LWP::Simple;
use HTML::TreeBuilder;

my $program = basename $0;
my $url = shift or die "Usage: $program url\n";
my $html = get($url) or die "Can't retrieve $url\n";
my $root = HTML::TreeBuilder->new;

$root->parse($html);
$root->eof;

my $title = $root->look_down(_tag => 'title');
if ($title) {
	print $title->as_text;
} else {
	print $url;
}
print "\n";

HTML::TreeBuilder (cont'd)

#!/usr/bin/perl -w
use strict;
use HTML::TreeBuilder 3;  # make sure our version isn't ancient
my $root = HTML::TreeBuilder->new;
$root->parse(  # parse a string...
q{
   <ul>
     <li>Ice cream.</li>
     <li>Whipped cream.
     <li>Hot apple pie <br>(mmm pie)</li>
   </ul>
});
$root->eof( );  # done parsing for this tree
$root->dump;   # print( ) a representation of the tree
$root->delete; # erase this tree because we're done with it

<html> @0 (IMPLICIT)
  <head> @0.0 (IMPLICIT)
  <body> @0.1 (IMPLICIT)
    <ul> @0.1.0
      <li> @0.1.0.0
        "Ice cream."
      <li> @0.1.0.1
        "Whipped cream. "
      <li> @0.1.0.2
        "Hot apple pie "
	<br> @0.1.0.2.1
        "(mmm pie)"
from Perl & LWP, O'Reilly and Associates, 2002.

XHTML and XPath

  • XHTML is HTML reconstituted as XML, e.g.,
    <img src="foo.gif" />
    instead of
    <img src='foo.gif'>
  • XPath is a language for addressing parts of an XML document
  • Example: //td/i/* selects
    <table border="1">
    	<tr>
    		<td><i>foo</i></td>
    	</tr>
    	<tr>
    		<td>bar</td>
    	</tr>
    	<tr>
    		<td><i>baz</i></td>
    	</tr>
    </table>
  • See http://www.zvon.org/xxl/XPathTutorial/General/examples.html
  • for a tutorial

RSS and Syndication

  • An XML format for syndication
    syndicate, v. To sell (a comic strip or column, for example) through a syndicate for simultaneous publication in newspapers or periodicals
  • Originally developed for use in portals (my.netscape.com, My Yahoo!, etc.)
  • Consists of a list of headlines and hyperlinks, with optional descriptions and metadata (e.g., last update time)
  • Sometimes used to republish content (e.g., recent headlines), but usually accessed from a specialized client
<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="2.0">
	<channel>
		<title>MacNN | The Macintosh News Network: Linux/Unix</title>
		<link>http://www.macnn.com/</link>
		<description>MacNN is the leading source for news about Apple and the Mac industry. It offers news, reviews, discussion, tips, troubleshooting, links, and reviews every day. The best place for Mac News. Period.</description>
		<language>en-us</language>
		<lastBuildDate>Mon, 10 Jan 2005 00:55:02 -0500</lastBuildDate>
		<image>
			<title>The Macintosh News Network</title>
			<url>http://www4.macnn.com/macnn/MacNN_120x50_BW_w_DS.gif</url>
			<link>http://www.macnn.com</link>
		</image>
		<item>
			<title>Portlock now supports Yellow Dog Linux on PowerPC</title>
			<link>http://www.macnn.com/news/26877</link>
			<description>Portlock today announced the latest release of Portlock Storage Manager, which adds support for Yell...</description>
			<pubDate>Mon,  8 Nov 2004 09:25:00 -0500</pubDate>
		</item>
		<item>
			<title>Sun reclaims Apple exec for Solaris marketing</title>
			<link>http://www.macnn.com/news/26832</link>
			<description>Sun Microsystems has hired a new vice president of marketing for its Solaris operating system, lurin...</description>
			<pubDate>Tue,  2 Nov 2004 18:50:00 -0500</pubDate>
		</item>

RSS Newsreader screenshots

Straw for GNOME

http://www.nongnu.org/straw/

RSS Newsreader screenshots (cont'd)

Raggle

http://www.raggle.org/

RSS Newsreader screenshots (cont'd)

Mozilla Firefox "Live Bookmarks"

http://www.mozilla.org/products/firefox/

RSS Newsreader screenshots (cont'd)

And Thunderbird

http://www.mozilla.org/products/thunderbird/

RSS 0.92

  • "Rich Site Summary"
  • Netscape and Dave Winer, Userland
<rss version="0.91">
  <channel>
    <title>XML.com</title>
    <link>http://www.xml.com/</link>
    <description>XML.com features a rich mix of information and services for the XML community.</description>
    <language>en-us</language>
    <item>
      <title>Normalizing XML, Part 2</title>
      <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
      <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
    </item>

RSS 1.0

  • "RDF Site Summary"
  • Rael Dornfest, O'Reilly and Associates
  • RSS-DEV Working Group
  • Based on W3C's Resource Description Framework
    http://www.w3.org/RDF/
<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns="http://purl.org/rss/1.0/"
  xmlns:dc="http://purl.org/dc/elements/1.1/"
>
  <channel rdf:about="http://www.xml.com/cs/xml/query/q/19">
    <title>XML.com</title>
    <link>http://www.xml.com/</link>
    <description>XML.com features a rich mix of information and services for the XML community.</description>
    <language>en-us</language>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/normalizing.html"/>
        <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/som.html"/>
        <rdf:li rdf:resource="http://www.xml.com/pub/a/2002/12/04/svg.html"/>
      </rdf:Seq>
    </items>
  </channel>
  <item rdf:about="http://www.xml.com/pub/a/2002/12/04/normalizing.html">
    <title>Normalizing XML, Part 2</title>
    <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
    <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
    <dc:creator>Will Provost</dc:creator>
    <dc:date>2002-12-04</dc:date>    
  </item>

RSS 2.0

  • "Really Simple Syndication"
  • Dave Winer, Scripting News
  • RSS 0.92 with enhancements
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>XML.com</title>
    <link>http://www.xml.com/</link>
    <description>XML.com features a rich mix of information and services for the XML community.</description>
    <language>en-us</language>
    <item>
      <title>Normalizing XML, Part 2</title>
      <link>http://www.xml.com/pub/a/2002/12/04/normalizing.html</link>
      <description>In this second and final look at applying relational normalization techniques to W3C XML Schema data modeling, Will Provost discusses when not to normalize, the scope of uniqueness and the fourth and fifth normal forms.</description>
      <dc:creator>Will Provost</dc:creator>
      <dc:date>2002-12-04</dc:date>    
    </item>
Examples from "What is RSS?" by Mark Pilgrim, http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html

Producing RSS

# create an RSS 0.91 file
use XML::RSS;
my $rss = new XML::RSS (version => '0.91');
$rss->channel(title          => 'freshmeat.net',
	link           => 'http://freshmeat.net',
	language       => 'en',
	description    => 'the one-stop-shop for all your Linux software needs',
	rating         => '(PICS-1.1 "http://www.classify.org/safesurf/" 1 r (SS~~000 1))',
	copyright      => 'Copyright 1999, Freshmeat.net',
	pubDate        => 'Thu, 23 Aug 1999 07:00:00 GMT',
	lastBuildDate  => 'Thu, 23 Aug 1999 16:20:26 GMT',
	docs           => 'http://www.blahblah.org/fm.cdf',
	managingEditor => 'scoop@freshmeat.net',
	webMaster      => 'scoop@freshmeat.net'
	);

$rss->image(title       => 'freshmeat.net',
	url         => 'http://freshmeat.net/images/fm.mini.jpg',
	link        => 'http://freshmeat.net',
	width       => 88,
	height      => 31,
	description => 'This is the Freshmeat image stupid'
	);

$rss->add_item(title => "GTKeyboard 0.85",
	link  => "http://freshmeat.net/news/1999/06/21/930003829.html",
	description => 'blah blah'
	);

	$rss->skipHours(hour => 2);
	$rss->skipDays(day => 1);

$rss->textinput(title => "quick finder",
	description => "Use the text input below to search freshmeat",
	name  => "query",
	link  => "http://core.freshmeat.net/search.php3"
	);
# print the RSS as a string
print $rss->as_string;

# or save it to a file
$rss->save("fm.rdf");

Consuming RSS


#!/usr/bin/env python
import sys, feedparser

def main():
	for url in sys.argv[1:]:
		feed = feedparser.parse(url)
		for entry in feed['entries']:
			print entry['title'], ' (', entry['link'], ')'

if __name__ == '__main__':
	main()

$ ./bin/rssdump http://xml.metafilter.com/rss.xml Torture Tapes ( http://www.metafilter.com/mefi/38493 ) If SEPTA is still around in six months... ( http://www.metafilter.com/mefi/38492 ) We Don't Need No Stinking Drummer! ( http://www.metafilter.com/mefi/38491 ) music ( http://www.metafilter.com/mefi/38490 ) Al Hartley ( http://www.metafilter.com/mefi/38489 )

Extracting RSS

Template::Extract

http://blog.simon-cozens.org/bryar.cgi/id_6522

print "Content-type: text/xml\n\n";
my $x = Template::Extract->new();
my %params;

path_info() =~ /(\w+)/ or die "No file name given!";
open IN, "rss/$1" or die "Can't open $file: $!";
while (<IN>) { /(\w+): (.*)/ and $params{$1} = $2; last if !/\S/; }

my $template = do {local $/; <IN>;};
$rss = new XML::RSS;
$rss->channel( title => $params{title}, link => $params{link},
		description => $params{description} );
my $doc = join "\n", grep { /\S/ } split /\n/, get($params{link});
$doc =~ s/\r//g;
$doc =~ s/^\s+//g;
for (@{$x->extract($template, $doc)->{records}}) {
	$rss->add_item(
		title => $_->{title},
		link => $_->{url},
		description => $_->{content}
	);
}
print $rss->as_string;

[% FOR records %] <!--START OF ABSTRACT OF NEWSITEM--> [% ... %] <a href="[% url %]"><acronym title="Click here to read this article"> [% title %]</acronym></a> ([% date %]) <BR> [% ... %]<font size="2">[% content %]</font></font></div> [% ... %]