java - MapReduce: How to get mapper to process multiple lines? -

goal: i want able specify number of mappers used on input file equivalently, want specify number of line of file each mapper take simple example:

for input file of 10 lines (of unequal length; illustration below), want there 2 mappers -- each mapper process 5 lines.

this arbitrary illustration file of 10 lines. each line not have of same length or contain same number of words this have:

(i have each mapper produces 1 "<map,1>" key-value pair ... summed in reducer)

package org.myorg; import java.io.ioexception; import java.util.stringtokenizer;  import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.nlineinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.inputformat;   public class test {    // prduce 1 "<map,1>" pair per mapper   public static class map extends mapper<object, text, text, intwritable>{     private final static intwritable 1 = new intwritable(1);     public void map(object key, text value, context context) throws ioexception, interruptedexception {       context.write(new text("map"), one);     }   }    //  cut down taking sum   public static class  reddish extends reducer<text,intwritable,text,intwritable> {     private intwritable result = new intwritable();      public void reduce(text key, iterable<intwritable> values, context context) throws ioexception, interruptedexception {             int sum = 0;       (intwritable val : values) {         sum += val.get();       }       result.set(sum);       context.write(key, result);     }   }     public static void main(string[] args) throws exception {     configuration conf = new configuration();     job job1 = job.getinstance(conf, "pass01");      job1.setjarbyclass(test.class);     job1.setmapperclass(map.class);     job1.setcombinerclass(red.class);     job1.setreducerclass(red.class);      job1.setoutputkeyclass(text.class);     job1.setoutputvalueclass(intwritable.class);      fileinputformat.addinputpath(job1, new path(args[0]));     fileoutputformat.setoutputpath(job1, new path(args[1]));      // // attempt#1     // conf.setint("mapreduce.input.lineinputformat.linespermap", 5);     // job1.setinputformatclass(nlineinputformat.class);      // // attempt#2     // nlineinputformat.setnumlinespersplit(job1, 5);     // job1.setinputformatclass(nlineinputformat.class);      // // attempt#3     // conf.setint(nlineinputformat.lines_per_map, 5);     // job1.setinputformatclass(nlineinputformat.class);      // // attempt#4     // conf.setint("mapreduce.input.fileinputformat.split.minsize", 234);     // conf.setint("mapreduce.input.fileinputformat.split.maxsize", 234);       system.exit(job1.waitforcompletion(true) ? 0 : 1);   } }

the above code, using above illustration data, produce

map 10

i want output

map 2

where first mapper first 5 lines, , sec mapper sec 5 lines.

you utilize nlineinputformat.

with nlineinputformat functionality, can specify how many lines should go mapper. e.g. if file has 500 lines, , set number of lines per mapper 10, have 50 mappers (instead of 1 - assuming file smaller hdfs block size).

edit:

here illustration using nlineinputformat:

mapper class:

import java.io.ioexception;  import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper;  public class mappernline extends mapper<longwritable, text, longwritable, text> {      @override     public void map(longwritable key, text value, context context)           throws ioexception, interruptedexception {          context.write(key, value);     }  }

driver class:

import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.path; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.nlineinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.lazyoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner;  public class driver extends configured implements tool {      @override     public int run(string[] args) throws exception {          if (args.length != 2) {             system.out                   .printf("two parameters required drivernlineinputformat- <input dir> <output dir>\n");              homecoming -1;         }          job job = new job(getconf());         job.setjobname("nlineinputformat example");         job.setjarbyclass(driver.class);          job.setinputformatclass(nlineinputformat.class);         nlineinputformat.addinputpath(job, new path(args[0]));         job.getconfiguration().setint("mapreduce.input.lineinputformat.linespermap", 5);          lazyoutputformat.setoutputformatclass(job, textoutputformat.class);         fileoutputformat.setoutputpath(job, new path(args[1]));          job.setmapperclass(mappernline.class);         job.setnumreducetasks(0);          boolean success = job.waitforcompletion(true);          homecoming success ? 0 : 1;     }      public static void main(string[] args) throws exception {         int exitcode = toolrunner.run(new configuration(), new driver(), args);         system.exit(exitcode);     } }

with input provided output above sample mapper written 2 files 2 mappers initialized :

part-m-00001

0 8 arbitrary illustration file 34 of 10 lines. 47 each line 62 not have

part-m-00002

77 of 80 same 89 length or contain 107 same 116 number of words

java hadoop input split mapreduce

Search This Blog

New Th

java - MapReduce: How to get mapper to process multiple lines? -

Comments

Post a Comment

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

Local Service User Logged into Windows -