Parsing large file with MPI in C++ -
Parsing large file with MPI in C++ -
i have c++ programme in want parse huge file, looking regex i've implemented. programme working ok when executed sequentially wanted run using mpi.
i started adaptation mpi differentiating master (the 1 coordinates execution) workers (the ones parse file in parallel) in main function. this:
mpi::init(argc, argv); ... if(rank == 0) { ... // master sends initial , ending byte every worker for(int = 1; < total_workers; i++) { array[0] = (i-1) * first_worker_file_part; array[1] = * first_worker_file_part; mpi::comm_world.send(array, 2, mpi::int, i, 1); } } if(rank != 0) readdocument(); ... mpi::finalize();
the master send every worker array 2 position contains byte start reading of file in position 0 , byte needs stop reading in position 1.
the readdocument() function looks (not parsing, each worker reading part of file):
void readdocument() { array = new int[2]; mpi::comm_world.recv(array, 10, mpi::int, 0, 1, status); int read_length = array[1] - array[0]; char* buffer = new char [read_length]; if (infile) { infile.seekg(array[0]); // start reading in supposed byte infile.read(buffer, read_length); } }
i've tried different examples, writing file output of reading running different number of processes. happens when run programme 20 processes instead of 10, example, lasts twice time read file. expected half time , can't figure why happening.
also, in different matter, want create master wait workers finish execution , print final time. there way "block" him while workers processing? cond_wait in c pthreads?
in experience people working on computer systems parallel file systems tend know parallel file systems question marks out, initially, not working on such system.
without specific hardware back upwards reading single file boils downwards scheme positioning single read head , reading sequence of bytes disk memory. situation not materially altered complex realities of many modern file systems, such raid, may in fact store file across multiple disks. when multiple processes inquire operating scheme access files @ same time o/s parcels out disk access according notion, perchance of fairness, no process gets starved. @ worst o/s spends much time switching disk access process process rate of reading drops significantly. efficient, in terms of throughput, approach single process read entire file in 1 go while other processes other things.
this situation, multiple processes contending scarce disk i/o resources, applies whether or not processes part of parallel, mpi (or similar) programme or exclusively separate programs running concurrently.
the impact observe -- instead of 10 processes each waiting own 1/10th share of file have 20 processes each waiting 1/20th share. oh, cry, each process reading half much info whole gang should take same amount of time file. no, respond, you've forgotten add together time takes o/s position , reposition read/write heads between accesses. read time comprises latency (how long take reading start 1 time request has been made) , throughput (how fast can i/o scheme pass bytes , fro).
it should easy come reasonable estimates of latency , bandwidth explains twice long reading 20 processes 10.
how can solve ? can't, not without parallel file system. might find having master process read whole file , parcel out faster current approach. might not, might find current approach fastest whole computation. if read time is, say, 10% of total computation time might decide it's reasonable overhead live with.
c++ file mpi
Comments
Post a Comment