google app engine - cron job fails in gae python -
i have script in google appengine started every 20 minutes cron.yaml. works locally, on own machine. when go (manually) url starts script online, works. however, script fails complete online, on google's instances, when cron.yaml in charge of starting it.
the log shows no errors, 2 debug messages:
d 2013-07-23 06:00:08.449 type(soup): <class 'bs4.beautifulsoup'> end type(soup) d 2013-07-23 06:00:11.246 type(soup): <class 'bs4.beautifulsoup'> end type(soup)
here's script:
# coding: utf-8 import jinja2, webapp2, urllib2, re bs4 import beautifulsoup bs google.appengine.api import memcache google.appengine.ext import db class article(db.model): content = db.textproperty() datetime = db.datetimeproperty(auto_now_add=true) companies = db.listproperty(db.key) url = db.stringproperty() class company(db.model): name = db.stringproperty() ticker = db.stringproperty() @property def articles(self): return article.gql("where companies = :1", self.key()) def companies_key(companies_name=none): return db.key.from_path('companies', companies_name or 'default_companies') def articles_key(articles_name=none): return db.key.from_path('articles', articles_name or 'default_articles') def scrape(): companies = memcache.get("companies") if not companies: companies = company.all() memcache.add("companies",companies,30) company in companies: links = links(company.ticker) links = set(links) link in links: if link not "none": article_object = article() text = fetch(link) article_object.content = text article_object.url = link article_object.companies.append(company.key()) #doesn't work. article_object.put() def fetch(link): try: html = urllib2.urlopen(url).read() soup = bs(html) except: return "none" text = soup.get_text() text = text.encode('utf-8') text = text.decode('utf-8') text = unicode(text) if text not "none": return text else: return "none" def links(ticker): url = "https://www.google.com/finance/company_news?q=nasdaq:" + ticker + "&start=10&num=10" html = urllib2.urlopen(url).read() soup = bs(html) div_class = re.compile("^g-section.*") divs = soup.find_all("div", {"class" : div_class}) links = [] div in divs: = unicode(div.find('a', attrs={'href': re.compile("^http://")})) link_regex = re.search("(http://.*?)\"",a) try: link = link_regex.group(1) soup = bs(link) link = soup.get_text() except: link = "none" links.append(link) return links
...and script's handler in main:
class scrapehandler(webapp2.requesthandler): def get(self): scrape.scrape() self.redirect("/")
my guess problem might double loop in scrape script, don't understand why.
update: articles indeed being scraped (as many there should be), , there no log errors, or debug messages @ all. looking @ log, cron job seemed execute perfectly. so, appengine's cron job panel says cron job failed.
i,m pretty sure error due deadlineexceedederror, did not run locally. scrape() script thing on fewer companies , articles, , not run exceeded deadline.
Comments
Post a Comment