python - scrapy start_requests not entering callback function -
i don't know why callback function (parse) not getting called start_requests urls.it gets terminated without entering parse function.
this cbrspider.py file
class cbrspider(scrapy.spider): name = "cbr" allowed_domains = ["careerbuilder.com"] start_urls = ( 'http://www.careerbuilder.com/browse/category/computer-and-mathematical', ) def start_requests(self): in range(1,2): yield request("http://ip.42.pl/raw", callback=self.parse_init) in range(1,2): yield request("http://www.careerbuilder.com/jobs-net-developer?page_number="+str(i)+"&sort=date_desc", callback=self.parse) in range(1,3): yield request("http://www.careerbuilder.com/jobs-it-manager?page_number="+str(i)+"&sort=date_desc", callback=self.parse) def parse_init(self, response): self.ip = response.xpath('//body/p/text()').extract() def parse(self, response): print "enter parse function" sel in response.xpath('//*[@class="job-list"]'): item = careerbuilderitem() item['ip'] = self.ip[0] item['name'] = sel.xpath('//div//h2[@class="job-title"]/a/text()').extract()[0] item['location'] = sel.xpath('//div[@class="columns small-12 medium-3 end"]//h4[@class="job-text"]/text()').extract()[0] yield item
this (which pretty code except item object scraping because didn't have it) seems run correctly (see output) python 2.7.9, scrapy 1.0.5 , twisted 16.0.0. python using?
the script run:
from subprocess import call call(["scrapy", "crawl", "cbr"])
or
from scrapy.crawler import crawlerprocess scrapy.utils.project import get_project_settings process = crawlerprocess(get_project_settings()) process.crawl('cbr') process.start() # script block here until crawling finished
the code:
from scrapy import spider, request class cbrspider(spider): name = "cbr" allowed_domains = ["careerbuilder.com"] start_urls = ( 'http://www.careerbuilder.com/browse/category/computer-and-mathematical', ) def start_requests(self): in range(1,2): yield request("http://ip.42.pl/raw", callback=self.parse_init) in range(1,2): yield request("http://www.careerbuilder.com/jobs-net-developer?page_number="+str(i)+"&sort=date_desc", callback=self.parse) in range(1,3): yield request("http://www.careerbuilder.com/jobs-it-manager?page_number="+str(i)+"&sort=date_desc", callback=self.parse) def parse_init(self, response): self.ip = response.xpath('//body/p/text()').extract() def parse(self, response): print "enter parse function" sel in response.xpath('//*[@class="job-list"]'): item = {} item['ip'] = self.ip[0] item['name'] = sel.xpath('//div//h2[@class="job-title"]/a/text()').extract()[0] item['location'] = sel.xpath('//div[@class="columns small-12 medium-3 end"]//h4[@class="job-text"]/text()').extract()[0] yield item
part of output:
2016-05-09 13:11:18 [scrapy] info: scrapy 1.0.5 started (bot: crawl_hhgreg) 2016-05-09 13:11:18 [scrapy] info: optional features available: ssl, http11 2016-05-09 13:11:18 [scrapy] info: overridden settings: {'newspider_module': 'crawl_hhgreg.spiders', 'spider_modules': ['crawl_hhgreg.spiders'], 'bot_name': 'crawl_hhgreg'} 2016-05-09 13:11:18 [scrapy] info: enabled extensions: closespider, telnetconsole, logstats, corestats, spiderstate 2016-05-09 13:11:18 [scrapy] info: enabled downloader middlewares: httpauthmiddleware, downloadtimeoutmiddleware, useragentmiddleware, retrymiddleware, defaultheadersmiddleware, metarefreshmiddleware, httpcompressionmiddleware, redirectmiddleware, cookiesmiddleware, chunkedtransfermiddleware, downloaderstats 2016-05-09 13:11:18 [scrapy] info: enabled spider middlewares: httperrormiddleware, offsitemiddleware, referermiddleware, urllengthmiddleware, depthmiddleware 2016-05-09 13:11:18 [scrapy] info: enabled item pipelines: jsonwriterpipeline 2016-05-09 13:11:18 [scrapy] info: spider opened 2016-05-09 13:11:18 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-05-09 13:11:18 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-05-09 13:11:20 [scrapy] debug: crawled (200) <get http://ip.42.pl/raw> (referer: none) 2016-05-09 13:11:22 [scrapy] debug: crawled (200) <get http://www.careerbuilder.com/jobs-net-developer?page_number=1&sort=date_desc> (referer: none) enter parse function 2016-05-09 13:11:22 [scrapy] debug: scraped <200 http://www.careerbuilder.com/jobs-net-developer?page_number=1&sort=date_desc> {'ip': u'62.38.254.183', 'name': u'systems developer (treasury management) - 6111 n river rd', 'location': u'\nrosemont, il\n'} 2016-05-09 13:11:23 [scrapy] debug: crawled (200) <get http://www.careerbuilder.com/jobs-it-manager?page_number=1&sort=date_desc> (referer: none) enter parse function 2016-05-09 13:11:23 [scrapy] debug: scraped <200 http://www.careerbuilder.com/jobs-it-manager?page_number=1&sort=date_desc> {'ip': u'62.38.254.183', 'name': u'medical technologist', 'location': u'\nhonolulu, hi\n'}
Comments
Post a Comment