Fall 2017 ECE Distinguished Seminar
How to Trade-off Latency for Parallelism
Walid Najjar, Ph.D.
Department of Computer Science and Engineering
University of California Riverside
Friday, November 17, 2017, 2:00 pm
The “big data” era brings in several new challenges to high-performance computing: (1) massive amounts of data on an unprecedented scale, (2) growing gap between processing and memory (the “memory wall”), (3) the move to in-memory processing using non-volatile main memory (NVMM) and the increased reliance on irregular applications for the processing of that data. Big data applications (e.g. recommendation systems, belief propagation, community detection, fraud detection, dimensionality reduction etc) rely of a algorithms that exhibit very poor localities, spatial or temporal (e.g. singular value, eigenvalue or tensor decompositions, generalized SpMV, stochastic gradient descent or time series similarity search) and hence cannot take advantage of the massive cache hierarchies available on modern high-end multicore processors. These cache structures take up to 80% of the chip area and a proportional amount of its power budget. Previous studies on database applications, both analytical and transactional, have shown that over 60% of the CPU cycles were spent on data cache stalls.
Latency masking multithreading, where threads relinquish control after issuing a memory request, has been demonstrated as an effective approach to achieving a higher throughput. Multithreaded CPUs are designed for a fixed maximum number of threads tailored for an average application. FPGAs, however, can be customized to specific applications. Using Little’s Law, we can show that the concurrency achieved by this execution model is product of the memory bandwidth and the average memory latency. In other words, the parallelism is equal to the number of outstanding memory requests. Hence, this model is capable of masking latency and achieving a high degree of parallelism provided (1) a large memory bandwidth and (2) a large number of independent threads. Inter-thread synchronization is achieved by using content addressable memories (CMAs), on the FPGA fabric, to cache the shared locks. This model has been evaluated on various applications such as bioinformatics, sparse linear algebra and database operations.
Walid A. Najjar is a Professor in the Department of Computer Science and Engineering at the University of California Riverside. His areas of research include computer architectures and compilers for parallel and high-performance computing, embedded systems, FPGA-based code acceleration and reconfigurable computing. He received a B.E. in Electrical Engineering from the American University of Beirut, and the M.S. and Ph.D. in Computer Engineering from the University of Southern California. He was on the faculty of the Department of Computer Science at Colorado State University, and with the USC-Information Sciences Institute. He was elected Fellow of the IEEE and the AAAS.