主讲人：Dr. Wei Zhang IBM T.J.Watson Research Center
Deep learning is a powerful machine learning tool that achieves promising results in image classification, natural language processing, speech recognition, and among many other application domains. Deep learning is particularly useful when the training data is abundant and training parameters are many, thus it demands massive computing resources (e.g., HPC clusters). How to efficiently use computation hardware at a large scale to solve deep learning optimization problem is a fundamental research topic. In this talk, I will first present the fundamentals of distributed deep learning algorithms, then I will present several lessons that we learned in the past three years of research into building large-scale deep learning systems. The covered topics include (i) our work in the study of tradeoff of model accuracy and runtime performance, (ii) how to build scale-up multi-GPU systems in a training as a service scenario on cloud, and (iii) how to build scale-out systems that run at the scale of hundreds of GPUs on HPCs. The resulting systems typically shorten the training time from weeks to hours and maintain or improve the baseline model accuracy. The lessons and experiences drew from several real-world systems -- IBM’s Natural Language Classifier (NLC), one of IBM’s most widely used cognitive services; IBM’s STT (Speech to Text) service, the key speech recognition technology behind IBM’s Jeopardy and the recent Debater project; and the experience of running our systems on CORAL machines (i.e., the precursor of IBM’s Summit super-computer, the fastest HPC machine in the world).
Dr. Wei Zhang (B.Eng’05, Beijing Unveristy of Technology; MSc’08, Technical University of Denmark; PhD’13, University of Wisconsin, Madison) is a research staff member at IBM T.J.Watson Research Center. Currently, he works in the machine learning acceleration department. His research interests include systems and large-scale optimization. His recent works in distributed deep learning are published in ICDM(2016,2017), IJCAI(2016,2017), MASCOT(2017), DAC(2017), AAAI (2018), NIPS (2017,2018) and ICML (2018). His work won the ICDM’16 best paper award runner-up and MASCOT’17 best paper nominee. His NIPS’17 paper are ICML’18 papers were both invited to present orally at a 20-min length in the conference. Prior to his IBM career, he studied under Prof. Shan Lu at UW-Madison, with a focus on concurrent software system reliability. While at Wisconsin, he published papers in ASPLOS (2010,2011,2013), PLDI(2011), OSDI(2012) and OOPSLA(2013). His PLDI’11 paper won the SIGPLAN Research Highlights Award.