Expected speedup from embarrassingly parallel task using Python Multiprocessing -

i'm learning utilize python's multiprocessing bundle embarrassingly parallel problems, wrote serial , parallel versions determining number of primes less or equal natural number n. based on read blog post , stack overflow question, came next code:

serial

import math import time  def is_prime(start, end):     """determine how many primes within given range"""     numprime = 0     n in range(start, end+1):         isprime = true         in range(2, math.floor(math.sqrt(n))+1):             if n % == 0:                 isprime = false                 break         if isprime:             numprime += 1     if start == 1:         numprime -= 1  # since 1 not prime      homecoming numprime  if __name__ == "__main__":     natnum = 0     while natnum < 2:         natnum = int(input('enter natural number greater 1: '))     starttime = time.time()     finalresult = is_prime(1, natnum)     print('elapsed time:', time.time()-starttime, 'seconds')     print('the number of primes <=', natnum, 'is', finalresult)

parallel

import math import multiprocessing mp import numpy import time   def is_prime(vec, output):     """determine how many primes in vector"""     numprime = 0     n in vec:         isprime = true         in range(2, math.floor(math.sqrt(n))+1):             if n % == 0:                 isprime = false                 break         if isprime:             numprime += 1     if vec[0] == 1:         numprime -= 1  # since 1 not prime     output.put(numprime)   def chunks(vec, n):     """evenly  split list n chunks"""     in range(0, len(vec), n):         yield vec[i:i+n]  if __name__ == "__main__":     natnum = 0     while natnum < 2:         natnum = int(input('enter natural number greater 1: '))     numproc = 0     while numproc < 1:         numproc = int(input('enter number of desired parallel processes: '))     starttime = time.time()     numsplits = math.ceil(natnum/numproc)     splitlist = list(chunks(tuple(range(1, natnum+1)), numsplits))     output = mp.queue()     processes = [mp.process(target=is_prime, args=(splitlist[jobid], output))                  jobid in range(numproc)]     p in processes:         p.start()     p in processes:         p.join()     print('elapsed time:', time.time()-starttime, 'seconds')     procresults = [output.get() p in processes]     finalresult = numpy.sum(numpy.array(procresults))     print('results each process:\n', procresults)     print('the number of primes <=', natnum, 'is', finalresult)

here n=10000000 (for parallel request 8 processes):

$ python serial_prime_test.py   come in natural number greater 1: 10000000 elapsed time: 162.1960825920105 seconds number of primes <= 10000000 664579 $ python parallel_prime_test.py  come in natural number greater 1: 10000000  come in number of desired parallel processes: 8 elapsed time: 49.41204643249512 seconds results each process: [96469, 86603, 83645, 80303, 81796, 79445, 78589, 77729] number of primes <= 10000000 664579

so looks can little on 3x speedup. here questions:

clearly not linear speedup, how much improve can (or speedup should realistically expect)? it looks amdahl's law answers this, don't know how determine fraction of programme strictly serial.

any help appreciated.

edit: there 4 physical cores, capable of hyperthreading.

i think want split work differently.

although programme divides range of candidate integers evenly across cores, work in each range not even. means cores finish early, , have nil do, while others still running. loses parallel efficiency, fast.

just create point, imagine have 1000 cores. first core sees little candidate numbers , doesn't take long factor them, goes idle. lastly (thousandth) core sees big candidate numbers, , takes much longer factor them. runs, while first core sits idle. wasted cycles. 4 cores.

what want when amount of work beingness handed core unknown, hand cores lot of modest sized chunks, many more there cores. cores can finish @ uneven rates, , each core goes find bit more work do. work-list algorithm. end un-eveness @ end, on little chunks not much wasted.

i'm not python programmer, coded solution in parlanse instead.

(includeunique `console.par') (includeunique `timer.par')  (define upper_limit 10000000)  (define candidates_per_segment 10) (define candidates_per_segment2 (constant (* candidates_per_segment 2)))  (= [prime_count natural] 0) [prime_finding_team team]  (define primes_in_segment (action (procedure [lower natural] [upper natural])    (;;       (do [candidate natural] lower upper 2       (block test_primality         (local (= [divisor natural] 3)            (;;            (while (< (* divisor divisor) candidate)                (ifthenelse (== (modulo candidate divisor) 0)               (exitblock test_primality)               (+= divisor 2)                )ifthenelse            )while            (ifthen (~= (* divisor divisor) candidate)               (consume (atomic+= prime_count))            )ifthen            );;         )local       )block       )do   );;    )action )define  (define main (action (procedure void)    (local [timer timer:timer]      (;;      (console:put (. `number of primes found: '))      (timer:reset (. timer))      (do [i natural] 1 upper_limit candidates_per_segment2         (consume (draft prime_finding_team primes_in_segment                      `lower':i                      `upper':(minimum upper_limit (- (+ candidates_per_segment2) 2))))      )do      (consume (wait (@ (event prime_finding_team))))      (timer:stop (. timer))      (console:putnatural prime_count)      (console:putnewline)      (timer:printelapsedtime (. timer) (. `parallel computed in '))      (console:putnewline)      );;   )local )action )define

parlanse looks lisp, works , compiles more c.

the worker primes_in_segment; takes range of candidate values defined parameters lower , upper. tries each candidate in range, , increments (atomically) total prime_count if candidate prime.

the total range split little packets of ranges (sequences of odd numbers) loop in main. parallelism happens on draft command, creates parallel execution grain of computation (not windows thread) , adds prime_finding_team, aggregate set of work representing prime factoring. (the purpose of team allow work managed unit, e.g., destroyed if necessary, not needed in program). arguments draft function run forked grain, , parameters. work accomplished parlanse-managed set of (windows) threads using work-stealing algorithm. if there much work,parlanse throttles work-generating grains, , spends energy executing grains pure computation.

one pass 1 candidate value each grain, there's more fork overhead per candidate , total runtime gets accordingly worse. we've chosen 10 empirically ensure fork overhead per range of candidates small; setting candidates per segment 1000 doesn't purchase much additional speedup.

the do loop manufactures work fast can. parlanse throttles draft step when there plenty parallelism useful. wait on team event, causes main programme wait team members complete.

we ran on hp hex-core amd phenom ii x6 1090t 3.2 ghz. execution runs below; first 1 cpu:

 >run -p1 -v ..\teamprimes parlanse rts: version 19.1.53 # processors = 1 number of primes found: 664579 parallel computed in 13.443294 seconds ---- parlanse rts: performance statistics   duration = 13.527557 seconds.   cpu time statistics:   kernel cpu time: 0.031s   user   cpu time: 13.432s   memory statistics: peak memory usage    : 4.02 mbytes   steals: 0  attempts: 0  success rate: 0.0%  work rediscovered: 0 exiting final status 0.

then 6 cpus (scales nicely):

>run -p6 -v ..\teamprimes parlanse rts: version 19.1.53 # processors = 6 number of primes found: 664579 parallel computed in 2.443123 seconds ---- parlanse rts: performance statistics   duration = 2.538972 seconds.   cpu time statistics: kernel cpu time: 0.000s user   cpu time: 14.102s total  cpu time: 14.102s   memory statistics: peak memory usage    : 4.28 mbytes   steals: 459817  attempts: 487334  success rate: 94.4%  work rediscovered: 153

you note total cpu time parallel version same serial version; because doing same work.

given python's "fork" , "join" operations, i'm sure there's python equivalent can code. might run out of space or threads because of possibility of many forks @ same time. (with candidates_per_segment @ 10, there 1 1000000 live grains). automatic throttling generation of work thing have. substitute, can set candidates_per_segment much larger number, e.g., 10000, means 1000 threads worst case. (i think still pay high cost due python's interpretive nature). set candidates per segment closer , closer 1e7/4, you'll approach exact behavior have nowadays python code.

python parallel-processing python-multiprocessing embarrassingly-parallel parallelism-amdahl

Search This Blog

New Th

Expected speedup from embarrassingly parallel task using Python Multiprocessing -

Comments

Post a Comment

Popular posts from this blog

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

mediawiki - How do I insert tables inside infoboxes on Wikia pages? -

SQL Server : need assitance parsing delimted data and returning a long concatenated string -