python - Specifying rules in the Linkextractors in scrapy -
python - Specifying rules in the Linkextractors in scrapy -
is there format specify rules in linkextractor in scrapy? have read documentation not clear me.in case url links values maintain increasing after first page (something &pg=2 , on). illustration see below:
start_urls = ['http://www.examples.com'] rules= [rule (linkextractor(allow=['www.examples.com/sports/companies?searchterm=news+sports&pg=2']), 'parse_torrent')]
let me know if there right way specify url in rules can scrape page 1, page 2...page 100.
thanks
v
if want extract links starting page. (for case http://www.examples.com)
you should create spider inheriting crawlspider
, , utilize next regular expression.
rules = ( rule(linkextractor(allow=[r'www.examples.com/sports/companies?searchterm=news+sports&pg=\d+'], callback='parse_torrent'), )
but seems know url rules, suggest generate url yourself.
from scrapy.http.request import request def start_requests(self): in xrange(1, 100): url = 'www.examples.com/sports/companies?searchterm=news+sports&pg=' + yield request(url=url, callback=parse_torrent)
python python-2.7 web-scraping scrapy
Comments
Post a Comment