Hadoop MapReduce: How to ensure multiple tasks are executed in parallel among all nodes -
i have task list file in hdfs , list of tasks cpu-bound , executed in small 5-node cluster hadoop mapreduce (map only). instance, task list file contains 10 lines, each of corresponds task command. execution of each task takes way long time, more efficient execute listed 10 tasks in parallel on 5 nodes.
however, task list file pretty small, data block located on 1 node node execute these 10 tasks based on data locality principle. there solution ensure 10 tasks executed in parallel on 5 nodes?
by default, map reduce run 1 mapper per split. split block, if have large file, 1 mapper per block size of file (default 128mb) process 128mb chunk in parallel other chunks.
in case, have series of lines in small file - 1 split, , therefore processed single mapper.
however, instead of having 1 file of 10 lines, can create 10 files of 1 line? have 10 splits, , map reduce run 10 mappers across cluster in parallel (depending on available resources) process tasks.
Comments
Post a Comment