IBM Scales AI With a New Distributed Deep Learning Library
IBM has achieved a milestone by enabling the practical scale of artificial intelligence (AI) models with its new distributed deep learning (DDL) library, which is capable of scaling to 256 GPUs in 64 IBM Power systems. IBM research also accomplished a new image recognition accuracy of 33.8 percent, HPCwire reports.
Furthermore, in benchmark testing, the PowerAI DDL used a vast data set of 7.5 million images and attained a 95 percent scaling efficiency on the Caffe deep learning framework in just seven hours. This beats the previous industry record set by Microsoft in 2014 for a similar task, which demonstrated 29.8 percent accuracy in 10 days.
Increased Speed and Accuracy
The PowerAI DDL from IBM will offer increased speed and accuracy and bring many advantages to enterprise clients, believes Sumit Gupta, vice president of AI and HPC for IBM’s Cognitive Systems business unit.
“If it takes 16 days to train an AI model, it’s not really practical,” Gupta told HPCwire. “You only have a few data scientists when you work in a large enterprise, and you really need to make them productive, so bringing down that 16 days to seven hours makes data scientists much more productive.”
Time-constrained applications are also poised to receive a major boost from faster machine learning, perhaps even more so in the future.
“In security, military, fraud protection and autonomous vehicles, you often only have minutes or seconds to train a system to deal with a new exploit or problem, but currently, it generally takes days,” said market analyst Rob Enderle, according to HPCwire. “This effectively reduces days to hours and provides a potential road map to get to minutes and even seconds.”
New DDL Addresses Performance Bottlenecks
IBM’s PowerAI DDL integrates across multiple servers, unlike using only a single deep learning framework or just individual GPUs to improve multinode communication — an effort that could potentially create performance bottlenecks.
According to IBM, “this contention typically causes massive deep learning models on popular open-source deep learning frameworks to run over days and weeks,” TechRepublic reports. To address the issue, IBM used “dozens of servers connected to hundreds of GPU accelerators popular in gaming systems with near-perfect scaling.”
Meanwhile, further improvement is on the horizon: IBM researchers find that it plausible to scale beyond 256 GPUs, HPCWire reports.
“We don’t see a reason why the method would slow down when we double the size of the system,” said Gupta, according to the source.