![]() ![]() Graph traversal on single nodes has been well studied and optimized on modern CPU architectures. There has been a push for HPC machines to be rated not just in Petaflops, but also in "GigaTEPS" (billions of traversed edges per second), and the Graph500 benchmark has been established for this purpose. Graph traversal is a widely used algorithm in a variety of fields, including social networks, business analytics, and high-performance computing among others. These strategies are evaluated using microbenchmarks that capture various execution configurations as well as the state-of-the-art CCSD(T) application module from the computational chemistry domain. We consider various alternatives in designing a distributed work stealing algorithm for CPU-GPU systems, while taking into account the impact of task distribution and data movement overheads. We take into account the unique nature of the accelerator model employed by GPUs, the significant performance difference between GPU and CPU execution as a function of problem size, and the distinct CPU and GPU memory domains. In this paper, we study the design of a system that uses work stealing for dynamic load balancing of task-parallel programs executed on hybrid distributed-memory CPU-graphics processing unit (GPU) systems in a global-address space framework. Exploiting task parallelism through work stealing has been extensively studied in shared and distributed-memory contexts. Task parallelism is an attractive approach to automatically load balance the computation in a parallel system and adapt to dynamism exhibited by parallel systems. We present challenges we faced and optimizations we used in our implementation necessary to obtain good performance. ![]() Berkeley upc benchmark uts code#We show good scalability and performance of our implementation in comparison with MPI code written in C. In this paper, we present implementation details and compare its scalability and performance with the MPI implementation of Graph500 benchmark. The level-synchronous BFS has been implemented using a PCJ (Parallel Computations in Java) library. Java so far is not extensively used in high performance computing, but because of its popularity, portability, and increasing capabilities is becoming more widely exploit especially for data analysis. In this paper, we present PGAS (Partitioned Global Address Space) version of the level-synchronous BFS (Breadth First Search) algorithm and its implementation written in Java. Although analysis of graphs is important it also poses numerous challenges especially for large graphs which have to be processed on multicore systems. Graph processing is used in many fields of science such as sociology, risk prediction or biology. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |