Python scraping with Mechanize, cookies (login) and proxies -
i scraping bad designed , managed government agricultural website application data requires me login , kicks me out , temporarily blocks ip (se asia, don't ask!)... trying use python , mechanize along proxy list ( http://proxy-hunter.blogspot.sg/2013/01/01-01-13-l1l2l3-http-proxies-1502.html ) work of scraping permits. however, can script login doesnot seem use proxies in set_proxies list... can why , suggest can fix it? found answer on stackoverflow answer "don't use mechanize"... well, got working script minus proxy aspect, isn't of helpful answer.
current code (reading ip site test whether ip changes or uses proxies - doesn't prints real ip):
import mechanize import cookielib beautifulsoup import beautifulsoup import html2text import urllib2 # mechanize browser/cookie stuff br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj) br.set_handle_equiv(true) br.set_handle_gzip(true) br.set_handle_redirect(true) br.set_handle_referer(true) br.set_handle_robots(false) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'firefox')] # proxies use br.set_proxies({"84.2.35.44:80": "http", "1.62.68.201:6675": "http", "119.254.90.18:8080": "http"}) # testing whether proxy works checking ip page = br.open('http://whatismyipaddress.com/') soup = beautifulsoup(br.response().read()) ip = soup.findall('div', {'style':'text-align:center;padding-top:4px;'}) print ip
so works in terminal comes real ip, rather of 3 proxies set.
finally, please consider in answer logging website in actual code, using sth like:
br.open('http://www.terriblegovernmentsite.hk/login.php') br.select_form(nr=3) br.form[login'user'] = 'researcher01' br.form[login'pass'] = 'mypassword' br.submit()
however, bit irrelevant since can't proxy work simple reading of static webpage, never mind whilst handling cookies.
thanks in advance help, appreciated.
Comments
Post a Comment