python - Specifying rules in the Linkextractors in scrapy -

June 15, 2014

is there format specify rules in linkextractor in scrapy? have read documentation not clear me.in case url links values maintain increasing after first page (something &pg=2 , on). illustration see below:

start_urls = ['http://www.examples.com'] rules= [rule (linkextractor(allow=['www.examples.com/sports/companies?searchterm=news+sports&pg=2']), 'parse_torrent')]

let me know if there right way specify url in rules can scrape page 1, page 2...page 100.

thanks

if want extract links starting page. (for case http://www.examples.com)

you should create spider inheriting crawlspider, , utilize next regular expression.

rules = (    rule(linkextractor(allow=[r'www.examples.com/sports/companies?searchterm=news+sports&pg=\d+'], callback='parse_torrent'), )

but seems know url rules, suggest generate url yourself.

from scrapy.http.request import request  def start_requests(self):     in xrange(1, 100):         url = 'www.examples.com/sports/companies?searchterm=news+sports&pg=' +         yield request(url=url, callback=parse_torrent)

python python-2.7 web-scraping scrapy

Search This Blog

New Th

python - Specifying rules in the Linkextractors in scrapy -

Comments

Post a Comment

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

Local Service User Logged into Windows -