google app engine - cron job fails in gae python -


i have script in google appengine started every 20 minutes cron.yaml. works locally, on own machine. when go (manually) url starts script online, works. however, script fails complete online, on google's instances, when cron.yaml in charge of starting it.

the log shows no errors, 2 debug messages:

d 2013-07-23 06:00:08.449 type(soup): <class 'bs4.beautifulsoup'> end type(soup)  d 2013-07-23 06:00:11.246 type(soup): <class 'bs4.beautifulsoup'> end type(soup) 

here's script:

# coding: utf-8 import jinja2, webapp2, urllib2, re  bs4 import beautifulsoup bs google.appengine.api import memcache google.appengine.ext import db  class article(db.model):     content = db.textproperty()     datetime = db.datetimeproperty(auto_now_add=true)     companies = db.listproperty(db.key)     url = db.stringproperty()  class company(db.model):     name = db.stringproperty()      ticker = db.stringproperty()      @property     def articles(self):         return article.gql("where companies = :1", self.key())   def companies_key(companies_name=none):   return db.key.from_path('companies', companies_name or 'default_companies')  def articles_key(articles_name=none):   return db.key.from_path('articles', articles_name or 'default_articles')  def scrape():    companies = memcache.get("companies")    if not companies:       companies = company.all()       memcache.add("companies",companies,30)    company in companies:       links = links(company.ticker)       links = set(links)       link in links:           if link not "none":                article_object = article()                text = fetch(link)                           article_object.content = text               article_object.url = link               article_object.companies.append(company.key()) #doesn't work.               article_object.put()  def fetch(link):     try:         html = urllib2.urlopen(url).read()         soup = bs(html)     except:         return "none"     text = soup.get_text()     text = text.encode('utf-8')     text = text.decode('utf-8')     text = unicode(text)     if text not "none":          return text     else:          return "none"   def links(ticker):     url = "https://www.google.com/finance/company_news?q=nasdaq:" + ticker + "&start=10&num=10"     html = urllib2.urlopen(url).read()     soup = bs(html)     div_class = re.compile("^g-section.*")     divs = soup.find_all("div", {"class" : div_class})     links = []     div in divs:         = unicode(div.find('a', attrs={'href': re.compile("^http://")}))          link_regex = re.search("(http://.*?)\"",a)         try:             link = link_regex.group(1)             soup = bs(link)             link = soup.get_text()          except:             link = "none"         links.append(link)      return links 

...and script's handler in main:

class scrapehandler(webapp2.requesthandler):     def get(self):         scrape.scrape()         self.redirect("/") 

my guess problem might double loop in scrape script, don't understand why.

update: articles indeed being scraped (as many there should be), , there no log errors, or debug messages @ all. looking @ log, cron job seemed execute perfectly. so, appengine's cron job panel says cron job failed.

i,m pretty sure error due deadlineexceedederror, did not run locally. scrape() script thing on fewer companies , articles, , not run exceeded deadline.


Comments

Popular posts from this blog

javascript - Laravel datatable invalid JSON response -

java - Exception in thread "main" org.springframework.context.ApplicationContextException: Unable to start embedded container; -

sql server 2008 - My Sql Code Get An Error Of Msg 245, Level 16, State 1, Line 1 Conversion failed when converting the varchar value '8:45 AM' to data type int -