python - Scrapy yield request from one spider to another -
i have following code:
#firstspider.py class firstspider(scrapy.spider): name = 'first' start_urls = ['https://www.basesite.com'] next_urls = [] def parse(self, response): url in response.css('bunch > of > css > here'): self.next_urls.append(url.css('more > css > here')) l = loader(item=item(), selector=url.css('more > css')) l.add_css('add', 'more > css') ... ... yield l.load_item() url in self.next_urls: new_urls = self.start_urls[0] + url yield scrapy.request(new_urls, callback=secondspider.parse_url) #secondspider.py class secondspider(scrapy.spider): name = 'second' start_urls = ['https://www.basesite.com'] def parse_url(self): """parse team data.""" return self # self htmlresponse not 'response' object def parse(self, response): """parse all.""" summary = self.parse_url(response) return summary #thirdspider.py class thirdspider(scrapy.spider): # take links second spider, continue:
i want able pass url scraped in spider 1 spider 2 (in different script). i'm curious why when do, 'response' htmlresponse
, not response
object ( when doing similar method in same class spider 1; don't have issue )
what missing here? how pass original response(s) second spider? ( , second onto third, etc..?)
you use redis shared resource between spiders https://github.com/rmax/scrapy-redis
run n spiders (don't close on idle state), each of them connected same redis , waiting tasks(url, request headers) there;
as side-effect push task data redis x_spider specific key (y_spider name).
Comments
Post a Comment