Spark from SQL External source automatically updated -


i have simple question. loading large extrenal source of data using spark

map<string, string> options = new hashmap<string, string>(); options.put("url", "jdbc:postgresql:dbserver"); options.put("dbtable", "schema.tablename");  dataframe mydf= sqlcontext.read().format("jdbc"). options(options).load(); 

i wanted know if external sql database updated changes reflect data frame or again need call load function populate dataframe.

in case need call load function again,is there more efficient way in spark can update data frame when external sources change?

short answer doesn't details relatively subtle. in general spark cannot event guarantee consistent state of database. each executor fetches own part of data inside separate transaction if data actively modified there no guarantee executors see same state of database.

this becomes more complicated when consider explicit , implicit (shuffle files) caching , possible executors failures , cache evictions. if want consistent view of database has supported both model , queries. in general means data source should support consistent point-in-time queries , every query execute spark should use specific timestamp.

the last question hard answer without knowing more use case there @ least 2 problems:

  • spark data structures not suited small incremental updates. cost of scheduling relatively high , incremental unions introduce different problems long lineages , complex partition management
  • there no vendor independent method monitor database changes

Comments

Popular posts from this blog

javascript - Laravel datatable invalid JSON response -

java - Exception in thread "main" org.springframework.context.ApplicationContextException: Unable to start embedded container; -

sql server 2008 - My Sql Code Get An Error Of Msg 245, Level 16, State 1, Line 1 Conversion failed when converting the varchar value '8:45 AM' to data type int -