ablog

不器用で落着きのない技術者のメモ

HTML::AutoPagerize を使ってみた

HTML::AutoPagerize を使ってみた。
コードは HTML::AutoPagerize - Utility to load AutoPagerize SITEINFO stuff - metacpan.org からほぼコピペ。

#!/usr/bin/perl

use HTML::AutoPagerize;
use LWP::Simple;

my $autopager = HTML::AutoPagerize->new;
$autopager->add_site(
	url         => 'http://.+.tumblr.com/',
	nextLink    => '//div[@id="content" or @id="container"]/div[last()]/a[last()]',
	pageElement => '//div[@id="content" or @id="container"]/div[@class!="footer" or @class!="navigation"]',
);

my $uri  = 'http://otsune.tumblr.com/';
my $html = LWP::Simple::get($uri);

my $res = $autopager->handle($uri, $html);
if ($res) {
	my $next_link = $res->{next_link};    # URI object
	my $content   = $res->{page_element}; # XML::XPathEngine::NodeSet object. may be empty
	use Data::Dumper;
	print Dumper $next_link;
}
  • 実行結果
$ ./autopagerize_tumblr.pl
$VAR1 = bless( do{\(my $o = 'http://otsune.tumblr.com/page/2')}, 'URI::http' );

WWW::Mechanize::AutoPager を使ってみた

WWW::Mechanize::AutoPager を使ってみた。
コードは WWW::Mechanize::AutoPager - Automatic Pagination using AutoPagerize - metacpan.org
からほぼコピペ。

#!/usr/bin/perl

use WWW::Mechanize;
use WWW::Mechanize::AutoPager;
use Data::Dumper;

my $mech = WWW::Mechanize->new;
$mech->autopager->add_site(
	url         => 'http://.+.tumblr.com/',
	nextLink    => '//div[@id="content" or @id="container"]/div[last()]/a[last()]',
	pageElement => '//div[@id="content" or @id="container"]/div[@class!="footer" or @class!="navigation"]',
);

$mech->get('http://otsune.tumblr.com/');

while () {
	print Dumper $mech->next_link;
	$mech->get($mech->next_link);
	last if ( $@ or !defined($mech->next_link) );
}
  • 実行結果
$ ./autopager_tumblr.pl
$VAR1 = bless( do{\(my $o = 'http://otsune.tumblr.com/page/2')}, 'URI::http' );
$VAR1 = bless( do{\(my $o = 'http://otsune.tumblr.com/page/3')}, 'URI::http' );
$VAR1 = bless( do{\(my $o = 'http://otsune.tumblr.com/page/4')}, 'URI::http' );

...

$VAR1 = bless( do{\(my $o = 'http://otsune.tumblr.com/page/86')}, 'URI::http' );
$VAR1 = bless( do{\(my $o = 'http://otsune.tumblr.com/page/87')}, 'URI::http' );
$VAR1 = bless( do{\(my $o = 'http://otsune.tumblr.com/page/88')}, 'URI::http' );

WWW::Mechanize::AutoPager + Web::Scraper で HTML からテキストを抽出して CSV を作成してみた

やりたいこと

  • Webアプリにログインする。
  • 検索ページに移動する。
  • 検索ボタンを押下する。
  • 検索結果一覧から詳細ページに移動、テキストを抽出してCSVファイルに書込む。
  • 次のページに移動して、同じことを繰り返す。
  • 次のページがなくなったら終了する。

ソースコード

  • foo_scraper.pl
#!/usr/bin/env perl
use strict;
use warnings;
use WWW::Mechanize;
use WWW::Mechanize::AutoPager;
use Web::Scraper;
use Text::CSV;
use utf8;
use Encode;

my $list_scraper = scraper { process '/html/body/div/div[3]/table/tbody/tr/td[3]/a', 'link[]' => '@href'; };
my $detail_scraper = scraper {
	process '/html/body/div/div/form/fieldset/div[1]/span', '1' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[2]/span', '2' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[3]/span', '3' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[4]/span', '4' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[5]/span', '5' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[6]/span', '6' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[7]/span', '7' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[8]/span', '8' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[9]/span', '9' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[10]/span', '10' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[11]/span', '11' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[12]/span', '12' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[13]/span', '13' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[14]/span', '14' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[15]/span', '15' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[16]/span', '16' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[17]/span', '17' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[18]/span', '18' => 'TEXT';
	process '/html/body/div/div/form/fieldset/div[19]/span', '19' => 'TEXT';
};

my $mech = WWW::Mechanize->new;
my $csv = Text::CSV->new;
$mech->autopager->add_site(
	url => 'http://foo.db.ablog.co.jp/',
	nextLink => '//a[@title="Go to next page"]' );
$mech->get('http://foo.db.ablog.co.jp/foo-bar/login');
$mech->submit_form('fields', {'name', 'foo', 'password', 'foo'}); # login
$mech->get('http://foo.db.ablog.co.jp/foo-bar/contract');         # move to list page
$mech->submit_form();                                             # submit search button

my @url_list;
while(){
	my $list = $list_scraper->scrape($mech->content); # scrape list page
	push @url_list, @{$list->{link}};                 # add urls to array
	$mech->get($mech->next_link);                     # move to next page
	last if ( $@ or !defined($mech->next_link) );     # quit if last page
}

open my $out_file, ">:encoding(utf8)", "list.csv" or die $!;
foreach (@url_list) {
	$mech->get("http://foo.db.ablog.co.jp/foo-bar/$_");        # move to detail page
	my $detail = $detail_scraper->scrape($mech->content);      # scrape detail page
	$csv->combine(@{$detail}{sort {$a <=> $b} keys %$detail}); # to csv
	print $out_file $csv->string . "\n";                       # write a line to file
}
close $out_file;

__END__

60行くらいでできた。
CPANモジュールすごい! ありがとう CPAN Author の方々!

top の TIME+ 列の単位は秒

top の TIME+ 列の単位は秒。

$ top
Tasks: 197 total,   5 running, 192 sleeping,   0 stopped,   0 zombie
Cpu(s): 98.3%us,  1.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Mem:   2075600k total,  2021828k used,    53772k free,     7464k buffers
Swap:  2096472k total,   294108k used,  1802364k free,  1758712k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                   
 8091 oracle    25   0 2176m 217m 185m R 99.2 10.7 739:34.52 oracle  
...

739:34.52 なら739分34秒。


追記(2016/09/29):
TIME+ はCPU時間

38. TIME+ -- CPU Time, hundredths
The same as TIME, but reflecting more granularity through
hundredths of a second.

top(1) - Linux manual page