Spark from SQL External source automatically updated -
i have simple question. loading large extrenal source of data using spark
map<string, string> options = new hashmap<string, string>(); options.put("url", "jdbc:postgresql:dbserver"); options.put("dbtable", "schema.tablename"); dataframe mydf= sqlcontext.read().format("jdbc"). options(options).load();
i wanted know if external sql database updated changes reflect data frame or again need call load function populate dataframe.
in case need call load function again,is there more efficient way in spark can update data frame when external sources change?
short answer doesn't details relatively subtle. in general spark cannot event guarantee consistent state of database. each executor fetches own part of data inside separate transaction if data actively modified there no guarantee executors see same state of database.
this becomes more complicated when consider explicit , implicit (shuffle files) caching , possible executors failures , cache evictions. if want consistent view of database has supported both model , queries. in general means data source should support consistent point-in-time queries , every query execute spark should use specific timestamp.
the last question hard answer without knowing more use case there @ least 2 problems:
- spark data structures not suited small incremental updates. cost of scheduling relatively high , incremental unions introduce different problems long lineages , complex partition management
- there no vendor independent method monitor database changes
Comments
Post a Comment