Deep Learning Performance with P100 GPUs

Deep Learning (DL), an area of Machine Learning, has achieved significant progress in recent years. Its application area includes pattern recognition, image classification, Natural Language Processing (NLP), autonomous driving and so on. Deep learning attempts to learn multiple levels of features of the input large data sets with multi-layer neural networks and make predictive decision for the new data. This indicates two phases in deep learning: first, the neural network is trained with large number of input data; second, the trained neural network is used to test/inference/predict the new data. Due to the large number of parameters (the weight matrix connecting neurons in different layers and the bias in each layer, etc.) and training set size, the training phase requires tremendous amounts of computation power.

To approach this problem, we utilize accelerators which include GPU, FPGA and DSP and so on. This blog focuses on GPU accelerator. GPU is a massively parallel architecture that employs thousands of small but efficient cores to accelerate the computational intensive tasks. Especially, NVIDIA? Tesla? P100? GPU uses the new Pascal? architecture to deliver very high performance for HPC and hyperscale workloads. In PCIe-based servers, P100 delivers around 4.7 and 9.3 TeraFLOPS of double and single precision performance, respectively. And in NVLink?-optimized servers, P100 delivers around 5.3 and 10.6 TeraFLOPS of double and single precision performance, respectively. This blog focuses on P100 for PCIe-based servers. P100 is also equipped with High Bandwidth Memory 2 (HBM2) which offers higher bandwidth than the traditional GDDR5 memory. Therefore, the high compute capability and high memory bandwidth make GPU an ideal candidate to accelerate deep learning applications.dell latitude e5540 battery

Deep Learning Frameworks and Dataset

In this blog, we will present the performance and scalability of P100 GPUs with different deep learning frameworks on a cluster. Three deep learning frameworks were chosen: NVIDIA’s fork of Caffe (NV-Caffe), MXNet and TensorFlow. Caffe is a well-known and widely used deep learning framework which is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors. It focuses more on the image classification and it supports multiple GPUs within a node but not across nodes. MXNet, jointly developed by collaborators from multiple universities and companies, is a lightweight, portable and flexible deep learning framework designed for both efficiency and flexibility. This framework scales to multiple GPUs within a node and across nodes. TensorFlow, developed by Google’s Brain team, is a library for numerical computation using data flow graphs. TensorFlow also supports multiples GPUs and can scale to multiple nodes.

All of the three deep learning frameworks we chose are able to perform the image classification task. With this in mind, we chose the well-known ImageNet Large Scale Visual Recognition Competition (ILSVRC) 2012 dataset. This training dataset contains 1281167 training images and 50000 validation images. All images are grouped into 1000 categories or classes. Another reason we chose ILSVRC 2012 dataset is that its workload is large enough to perform long time training and it is a benchmark dataset used by many deep learning researchers.dell optiplex 980 power supply

Testing Methodology

This blog quantifies the performance of deep learning frameworks using NVIDIA’s P100-PCIe GPU and Dell’s PowerEdge C4130 server architecture. Figure 1 shows the testing cluster. The cluster includes one head node which is Dell’s PowerEdge R630 and four compute nodes which are Dell’s PowerEdge C4130. All nodes are connected by an InfiniBand network and they share disk storage through NFS. Each compute node has 2 CPUs and 4 P100-PCIe GPUs. All of the four compute nodes have the same configurations. Table 1 shows the detailed information about the hardware configuration and software used in every compute node.

We measured the metrics of both images/sec and training time.

The images/sec is the measurement for training speed while the training time is the wall clock time for training, I/O operation and other overhead. The images/sec number was obtained from “samples/sec” in MXNet and TensorFlow in the output log files. NV-Caffe listed “M s/N iter” as output which means M seconds were taken to process N iterations, or N batches. The metric images/sec was calculated by “batch_size*N/M”. The batch size is the number of training samples in one forward/backward pass through all layers of a neural network. The images/sec number was averaged across all iterations to take into account the deviations.dell optiplex 990 power supply

The training time was obtained from “Time cost” in MXNet output logs. For NV-Caffe and TensorFlow, their output log files contained the wall-clock timestamps during the whole training. So the time difference from the start to the end of the training was calculated as the training time.

Since NV-Caffe did not support distributed training, it was not executed on multiple nodes. The MXNet framework was able to run on multiple nodes. However, the caveat was that it could only use the Ethernet interface (10 Gb/s) on the compute nodes by default, and therefore the performance was not as high as expected. To solve this issue, we have manually changed its source code so that the high-speed InfiniBand interface (EDR 100 Gb/s) was used. The training with TensorFlow on multiple nodes was able to run but with poor performance and the reason is still under investigation.

Table 2 shows the input parameters used in different deep learning frameworks. In all deep learning frameworks, the neural network training requires many epochs or iterations. Whether the term epoch or iteration is used is determined by each framework. An epoch is a complete pass through all samples in a given dataset, while one iteration processes only one batch of samples. Therefore, the relationship between iterations and epochs is: epochs = (iterations*batch_size)/training_samples. Every framework only needs either epochs or iterations so that another parameter can be easily determined by this formula. Since our goal was to measure the performance and scalability of Dell’s server and not to train an end-to-end image classification model, the training was a subset of the full model training which was large enough to reflect performance. Therefore we chose a smaller number of epochs or iterations so that they could finish in a reasonable time. Although only partial training was performed, the training speed (images/sec) remained relatively constant over this period.optiplex 390 power supply

The batch size is one of the hyperparameters the user needs to tune when training a neural network model with mini-batch Stochastic Gradient Descent (SGD). The batch size in the table are commonly used sizes. Whether these batch sizes are optimized for model accuracy is left in future work. For all neural networks in all frameworks, we increased the batch size proportionally with increasing number of GPUs. In the meantime, the number of iterations was adjusted so that the total number of samples was fixed no matter how many GPUs were used. Since epoch has nothing to do with batch size, its value was not changed when a different number of GPUs was used. For MXNet GoogleNet, there was runtime error if different bath sizes were used for different number of GPUs, so we used constant batch size. Learning rate is another hyperparameter that needs to be tuned. In this experiment, the default value in each framework was used.b173zan01.0