java - MapReduce: How to get mapper to process multiple lines? -



java - MapReduce: How to get mapper to process multiple lines? -

goal: i want able specify number of mappers used on input file equivalently, want specify number of line of file each mapper take simple example:

for input file of 10 lines (of unequal length; illustration below), want there 2 mappers -- each mapper process 5 lines.

this arbitrary illustration file of 10 lines. each line not have of same length or contain same number of words this have:

(i have each mapper produces 1 "<map,1>" key-value pair ... summed in reducer)

package org.myorg; import java.io.ioexception; import java.util.stringtokenizer; import org.apache.hadoop.conf.configuration; import org.apache.hadoop.fs.path; import org.apache.hadoop.io.intwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.mapper; import org.apache.hadoop.mapreduce.reducer; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.input.fileinputformat; import org.apache.hadoop.mapreduce.lib.input.nlineinputformat; import org.apache.hadoop.mapreduce.lib.input.textinputformat; import org.apache.hadoop.mapreduce.inputformat; public class test { // prduce 1 "<map,1>" pair per mapper public static class map extends mapper<object, text, text, intwritable>{ private final static intwritable 1 = new intwritable(1); public void map(object key, text value, context context) throws ioexception, interruptedexception { context.write(new text("map"), one); } } // cut down taking sum public static class reddish extends reducer<text,intwritable,text,intwritable> { private intwritable result = new intwritable(); public void reduce(text key, iterable<intwritable> values, context context) throws ioexception, interruptedexception { int sum = 0; (intwritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(string[] args) throws exception { configuration conf = new configuration(); job job1 = job.getinstance(conf, "pass01"); job1.setjarbyclass(test.class); job1.setmapperclass(map.class); job1.setcombinerclass(red.class); job1.setreducerclass(red.class); job1.setoutputkeyclass(text.class); job1.setoutputvalueclass(intwritable.class); fileinputformat.addinputpath(job1, new path(args[0])); fileoutputformat.setoutputpath(job1, new path(args[1])); // // attempt#1 // conf.setint("mapreduce.input.lineinputformat.linespermap", 5); // job1.setinputformatclass(nlineinputformat.class); // // attempt#2 // nlineinputformat.setnumlinespersplit(job1, 5); // job1.setinputformatclass(nlineinputformat.class); // // attempt#3 // conf.setint(nlineinputformat.lines_per_map, 5); // job1.setinputformatclass(nlineinputformat.class); // // attempt#4 // conf.setint("mapreduce.input.fileinputformat.split.minsize", 234); // conf.setint("mapreduce.input.fileinputformat.split.maxsize", 234); system.exit(job1.waitforcompletion(true) ? 0 : 1); } }

the above code, using above illustration data, produce

map 10

i want output

map 2

where first mapper first 5 lines, , sec mapper sec 5 lines.

you utilize nlineinputformat.

with nlineinputformat functionality, can specify how many lines should go mapper. e.g. if file has 500 lines, , set number of lines per mapper 10, have 50 mappers (instead of 1 - assuming file smaller hdfs block size).

edit:

here illustration using nlineinputformat:

mapper class:

import java.io.ioexception; import org.apache.hadoop.io.longwritable; import org.apache.hadoop.io.text; import org.apache.hadoop.mapreduce.mapper; public class mappernline extends mapper<longwritable, text, longwritable, text> { @override public void map(longwritable key, text value, context context) throws ioexception, interruptedexception { context.write(key, value); } }

driver class:

import org.apache.hadoop.conf.configuration; import org.apache.hadoop.conf.configured; import org.apache.hadoop.fs.path; import org.apache.hadoop.mapreduce.job; import org.apache.hadoop.mapreduce.lib.input.nlineinputformat; import org.apache.hadoop.mapreduce.lib.output.fileoutputformat; import org.apache.hadoop.mapreduce.lib.output.lazyoutputformat; import org.apache.hadoop.mapreduce.lib.output.textoutputformat; import org.apache.hadoop.util.tool; import org.apache.hadoop.util.toolrunner; public class driver extends configured implements tool { @override public int run(string[] args) throws exception { if (args.length != 2) { system.out .printf("two parameters required drivernlineinputformat- <input dir> <output dir>\n"); homecoming -1; } job job = new job(getconf()); job.setjobname("nlineinputformat example"); job.setjarbyclass(driver.class); job.setinputformatclass(nlineinputformat.class); nlineinputformat.addinputpath(job, new path(args[0])); job.getconfiguration().setint("mapreduce.input.lineinputformat.linespermap", 5); lazyoutputformat.setoutputformatclass(job, textoutputformat.class); fileoutputformat.setoutputpath(job, new path(args[1])); job.setmapperclass(mappernline.class); job.setnumreducetasks(0); boolean success = job.waitforcompletion(true); homecoming success ? 0 : 1; } public static void main(string[] args) throws exception { int exitcode = toolrunner.run(new configuration(), new driver(), args); system.exit(exitcode); } }

with input provided output above sample mapper written 2 files 2 mappers initialized :

part-m-00001

0 8 arbitrary illustration file 34 of 10 lines. 47 each line 62 not have

part-m-00002

77 of 80 same 89 length or contain 107 same 116 number of words

java hadoop input split mapreduce

Comments

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

Local Service User Logged into Windows -