Objectives
Abstract: Today, GPU-based heterogeneous HPCs offer new opportunities for the exascale era with their high performance and low energy consumption. However, many studies have shown that GPUs are less reliable than CPUs. However, reliability issues that might arise due to the integration of general purpose GPUs into HPCs have not been adequately analyzed. It is foreseen that if the inherent reliability issues of the GPUs are not understood enough, this may lead an increase in the cost of the integration of GPUs into HPCs where many critical calculations are done. In this talk, a survey about the reliability issues of GPU-based heterogeneous systems will be presented. To this end, the studies analyzing the error/failure logs of homogeneous and heterogeneous systems will be examined comparatively in order to understand the reliability behaviors of GPU-based heterogeneous systems. Also the error tolerance techniques developed for GPU-based systems will be reviewed and the challenges will be clarified.
Short bio: Dr. Gulay Yalcin is an Assistant Professor in Abdullah Gul University in Turkey. She holds B.Sc degree in Computer Engineering from Hacettepe University and M.Sc. degree in Computer Engineering from TOBB University of Economics and Technology. Between 2009-2014, she pursued his Ph.D. degree in the Computer Architecture department at University Politècnica de Catalunya. During her Ph.D. studies, she also worked as a student researcher in Barcelona Supercomputing Center. Prior to joining to Abdullah Gül University as a faculty of Computer Engineering in December 2015, she continued working in Barcelona Supercomputing Center as a post-doc. Her research focuses on the highly reliable and energy-efficient fault tolerant designs in computer architecture.