Deep learning, a subset of machine learning, is based on artificial neural networks. The term “deep” in deep learning refers to multiple layers employed in the network.
The deeper the network, the better the performance observed in modern deep learning networks. However, this unbounded number of layers comes at the cost of size. To better utilize this advantage, optimized implementation of the network is required.
In deep learning applications, higher throughputs are provided by accelerators in comparison to general-purpose computing systems (CPUs), but there are several other metrics and usage scenarios on which CPUs are preferred or are superior.
Optimization is not limited to training or inferencing models on a CPU as it is equally important during deployment or training on a computer with multiple CPUs i.e. distributed training.
Tensorflow CPU Optimization
Tensorflow is an open-source machine learning framework running end-to-end for large-scale machine learning and deep learning. It provides a complete, adaptable ecosystem of tools, libraries, and community resources for developing and deploying machine learning models.
Various CPU optimization techniques implemented by Tensorflow are:
- Graph-Level optimizations
- Operator-Level optimizations
Tensorflow is one of the deep learning libraries written in Python, where it is represented as a data-flow graph. Optimizing the operations in the graph accelerates the performance of the corresponding neural network.
One of the graph-level optimizations implemented by Intel is:
As the name suggests, the optimization in this method is achieved by fusing a sequence of different types of operators in order to decrease computation overlap, thus improving performance in terms of execution time by reducing memory-bound.
Various operator-level optimization that is implemented in Tensorflow are:
Batch Normalization Folding
Batch normalization is a commonly used technique during model training that involves certain operations. But these operations are unwanted during inference; thus, removing these operations improves inference performance.
Filters are used in convolutional neural networks whose values are learned during training. And these values are constant during inference; Intel now uses an internal format to represent these optimized and cached filters. The cached filter can be reused without performing the conversion operations that accelerate performance on the CPU.
Caffee CPU Optimization
Caffee is another deep learning library mostly used for vision-related tasks. There are some optimization techniques used to improve the performance. They are:
- Scalar and Serial Optimizations
- Code Parallelization with OpenMP
Scalar and Serial Optimizations
This involves the following techniques:
With the vector-instruction-set, vectorization of the operations such as addition, subtraction, and so on was performed, improving the computational performance.
Generic Code Optimizations:
- Reducing algorithm complexity
- Reducing the number of calculations
- Unwinding loops
- Common-code elimination
Code Parallelization with OpenMP
Neural network layers such as Convolution, Deconvolution, ReLu, etc were optimized by applying OpenMP multithreading parallelization.
PyTorch CPU Optimization
Intel Extension on Tensorflow has improved performance utilizing graph optimization and operator optimization and has provided easy-to-use APIs for users. Apache has launched a new project called Apache TVM which assures big performance improvement when using PyTorch for CPU optimization.
This optimization technique involves the following:
- Channels Last
- Auto Miszed Precision (AMP)
- Graph Optimization
- Operator Optimization
Changing the memory format from NCHW to NHWC has accelerated the performance of convolution neural networks in PyTorch.
Auto Miszed Precision (AMP)
Low precision data type such as BFloat16 has been supported in PyTorch by Intel which has boosted the performance.
Similar to Tensorflow, frequently used operator patterns, like Conv2D+ReLU, Linear+ReLU, etc have been fused to support a greater performance.
Several operators are optimized and even implemented several customized operators in order to improve performance by Intel.
Problems brought by Lack of CPU Optimizations
With the recent success of deep learning, powerful computational devices such as GPU and TPU are highly in demand, but the cost of both devices is extremely high compared to the CPU. They also have less memory capacity compared to the CPU.
Some of the problems brought by the lack of CPU optimizations are:
Alternative CPUs to consider are GPU and TPUs, but memory bottleneck is one of the biggest problems that forces tech developers to use smaller batch sizes; thus leading to a slower training process. A lack of CPU optimization forces the use of these devices.
Parallel computation is required for some of the machine learning architecture such as RNN with larger sequences and DNNs such as InceptionNet with various filters. GPUs have many slow cores compared to CPUs. Thus parallelism is more suited to CPUs and leads to performance improvement. Therefore, the lack of CPU optimizations forbids us from parallelism.
CPUs are more suitable for mobile systems providing performance similar to GPU. However, the improper use of CPUs can be costly and even worsen the performance, so an unoptimized CPU may not be suitable for mobile systems.
Boosting CPU Performance
The performance of machine learning and deep learning models depends on many factors. One of them is resource constraints. If such factors are ignored, this causes difficulty in model training, also resulting in model collapse sometimes. Addressing the lack of CPU optimization shows a significant performance boost and reduction in cost, making CPU optimization an integral part of deep learning frameworks.