sorting - What's the fastest approach to merging a small number of large, already sorted lists in Hadoop? -

May 15, 2012

i've got little hadoop (cdh5.1.0, mrv2/yarn) cluster (5x nodes 4cpu, 16gb ram, 600gb disk) of contains little number ~30 of ~15gb sequencefiles. sequencefiles contains pairs of byteswritable/byteswritable, , keys not uniformly distributed across possible keyspace - it's rather lumpy. however, these files sorted.

i need merge these create single, sorted sequencefile, efficiently possible. i've tried number of approaches already, haven't been successful.

initially, tried using mapreduce job randomsampler , totalorderpartitioner, , around 1000 reducers. however, turns out because of non-uniformity of input keys, randomsampler isn't @ distributing info across partitions, , end 999 reducers succeeding, , 1 failing due running out of local disk.

it occurs me approach doesn't appear take advantage of fact input info sorted - solve problem if input info in random order.

i notice there's sequencefile.sorter class aims merge sequencefiles single sorted output. while single-threaded process, improve mr approach? there different mr approach take exploit fact input info sorted? seem fastest way simple merge, there way in parallel across cluster?

sorting hadoop mapreduce

Search This Blog

New Th

sorting - What's the fastest approach to merging a small number of large, already sorted lists in Hadoop? -

Comments

Post a Comment

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

SQL Server : need assitance parsing delimted data and returning a long concatenated string -