Python scraping with Mechanize, cookies (login) and proxies -

- June 15, 2015

i scraping bad designed , managed government agricultural website application data requires me login , kicks me out , temporarily blocks ip (se asia, don't ask!)... trying use python , mechanize along proxy list ( http://proxy-hunter.blogspot.sg/2013/01/01-01-13-l1l2l3-http-proxies-1502.html ) work of scraping permits. however, can script login doesnot seem use proxies in set_proxies list... can why , suggest can fix it? found answer on stackoverflow answer "don't use mechanize"... well, got working script minus proxy aspect, isn't of helpful answer.

current code (reading ip site test whether ip changes or uses proxies - doesn't prints real ip):

import mechanize import cookielib beautifulsoup import beautifulsoup import html2text import urllib2  # mechanize browser/cookie stuff  br = mechanize.browser() cj = cookielib.lwpcookiejar() br.set_cookiejar(cj)   br.set_handle_equiv(true) br.set_handle_gzip(true) br.set_handle_redirect(true) br.set_handle_referer(true) br.set_handle_robots(false) br.set_handle_refresh(mechanize._http.httprefreshprocessor(), max_time=1) br.addheaders = [('user-agent', 'firefox')]  # proxies use  br.set_proxies({"84.2.35.44:80": "http", "1.62.68.201:6675": "http", "119.254.90.18:8080": "http"})  # testing whether proxy works checking ip  page = br.open('http://whatismyipaddress.com/') soup = beautifulsoup(br.response().read()) ip = soup.findall('div', {'style':'text-align:center;padding-top:4px;'}) print ip

so works in terminal comes real ip, rather of 3 proxies set.

finally, please consider in answer logging website in actual code, using sth like:

br.open('http://www.terriblegovernmentsite.hk/login.php') br.select_form(nr=3) br.form[login'user'] = 'researcher01' br.form[login'pass'] = 'mypassword' br.submit()

however, bit irrelevant since can't proxy work simple reading of static webpage, never mind whilst handling cookies.

thanks in advance help, appreciated.

Search This Blog

Shell

Python scraping with Mechanize, cookies (login) and proxies -

Comments

Post a Comment

Popular posts from this blog

javascript - Laravel datatable invalid JSON response -

sql server 2008 - My Sql Code Get An Error Of Msg 245, Level 16, State 1, Line 1 Conversion failed when converting the varchar value '8:45 AM' to data type int -

java - Exception in thread "main" org.springframework.context.ApplicationContextException: Unable to start embedded container; -