AI computing characteristics
Designing and deploying a specialized chip requires balancing various indicators. The indicators that need to be considered vary in different scenarios, and the design approach may also vary. Common chip design indicators include:
Power consumption: The energy consumed by a chip circuit during operation.
Peak performance: The total number of operations per second calculated by the chip.
Throughput: The amount of data that a chip can process per unit of time.
Area: The more transistors there are, the larger the chip size, the more precise the process, and the smaller the chip size.
Flexibility: The higher the flexibility and programmability, the more scenarios one can adapt to.
Cost: includes chip design costs and individual chip generation and processing costs.
From the perspective of computing power, the more cores involved in computing in a chip, the higher the bandwidth requirement for data transmission. For example, a GPU contains hundreds to thousands of computing cores, which occupy a considerable amount of data bandwidth when reading and writing data simultaneously. Secondly, from the perspective of computing granularity, the best computing efficiency can only be achieved when the peak computing power is less than or equal to the maximum bandwidth. Although on-chip storage can provide extremely high bandwidth and read/write performance, it will occupy a limited chip area, so it is necessary to balance performance and bandwidth. Once again, in terms of universality, the stronger the programmable ability, the more application scenarios can be used. However, in order to be compatible with more applications, there will be a certain balance and redundancy in the architecture design, which often leads to a certain degree of performance degradation in a single task. Therefore, it is necessary to balance customization to achieve ultimate product performance and software programmability to expand the scope of scenarios.
Access to storage
Deep neural networks consist of a large number of network layers, each layer containing a large number of parameters and intermediate feature data, resulting in significant data access and computational complexity. For example, the convolutional layer contains a large number of multidimensional convolutional kernels, and the sliding window convolutional operation requires the convolutional kernel parameters to participate in the operation multiple times. Due to the fact that the number of convolutional kernel parameters is generally much larger than the cache capacity, most of the time the parameters need to be repeatedly accessed from main memory. For example, in the early days of AlexNet, there were 60 million parameter sets, while the latest ChatGPT based on the Transformer architecture had a parameter set of 175 billion. Such a huge parameter set not only required a huge amount of storage space, but also required high access bandwidth, memory management, and computation requirements for efficient operations.
Usually, in chip design, the parallelism of the system can be improved by increasing the number of computing cores. When the parallelism increases, data reading and writing becomes a bottleneck, that is, encountering a "memory wall". Memory access can be addressed through the following aspects:
1.By increasing the size of on-chip cache and the number of registers, or increasing the bandwidth of the memory access bus, we can improve memory access efficiency and thereby reduce data waiting time;
2.Make the data stay on the chip as much as possible to avoid repeated access to main memory and reduce the waiting time of computing units;
3.By using a data flow pattern, data can communicate and flow between different computing units without passing through main memory.
Due to the fact that the calculation results of the previous layer in deep learning will be reused in the next layer, the model parameters are also repeatedly used in different clock cycles. The first solution reduces the overhead of data being repeatedly written back and loaded by increasing on-chip storage. The second solution directly places the reused data on the on-chip buffer to achieve data reuse, with a more precise and controllable reuse method and granularity. The third solution is to allow the results of the previous cycle's calculations to flow to other computing cores, allowing them to directly participate in the calculation without requiring memory access operations in the next cycle. The pulsating array and popular DataFlow data flow architecture adopted by Google TPU are designed using this approach.
Power dissipation
Due to the large number of parameters and limited on-chip storage space, deep learning cannot store all data. The vast majority of data is stored in main memory, which inevitably leads to frequent memory access. In most deep learning tasks, the power consumption caused by data access is higher than that caused by computation. Among them, the register closest to the computing unit has the smallest power consumption, while the farthest off chip DRAM has 200 times its power consumption. So how to keep and reuse data as much as possible in on-chip storage can effectively reduce the power consumption of data access. However, on-chip storage is constrained by cost and area, and cannot increase infinitely. How to solve the power consumption problem caused by large-scale data access is a key challenge that low-power AI chips need to focus on.
Sparsity
Sparsity refers to the fact that deep learning network models have a very large parameter and feature capacity, and involve a large number of multiplication and addition operations on zeros in calculations. In calculation, multiplying 0 by any number is 0, and adding 0 to any number is the original number. If the multiplication and addition process involving 0 can directly output the result 0 without using a computing unit, this can save power consumption generated by the operation. If data can be read and written to memory without the need, it can reduce the cost of data transfer.
In deep learning, sparsity mainly includes two types: model sparsity and transient sparsity. Model sparsity is related to model parameters. On the one hand, the model parameters themselves contain a large number of 0 or very small values, and on the other hand, some regularization and gating functions are added during the training process to increase the sparsity of model parameters. Short term sparsity is related to the operational process of the model, as it is related to both input data and model parameters. For example, the output data of operators such as Dropout and ReLU is highly sparse data. Statistics show that the sparsity of classic networks such as AlexNet, VGG, and ResNet can reach around 90%. If the sparsity of the network can be effectively utilized, it can significantly improve the computational efficiency of the network.
Mixing accuracy
Due to the inherent redundancy of neural networks, after careful design and tuning, high-precision neural networks can achieve overall accuracy unchanged or only minimal accuracy loss. However, using high-precision computing can greatly reduce computational and storage burden and reduce power consumption. Experiments have shown that the accuracy loss of training using 16 bit floating-point multiplication and 32 bit floating-point addition on AlexNet, VGG, ResNet, and other networks can be ignored. Reasoning using 8 bit fixed-point multiplication and 16 bit fixed-point addition has almost no accuracy loss.
Low precision computing has become a trend in AI chips, especially in inference chips. Low precision computing not only requires the ability to achieve low precision training and model quantization in algorithms, but also requires the ability to support low precision operations in instruction architecture design and hardware computing units. It is a comprehensive solution that combines software and hardware. With the increasing demand for low-power and high-performance in AI, neural networks have gradually expanded from 32-bit floating-point operations to various low precision operations such as 16bit, 8bit, 4bit, and even binary networks.
Universality
Currently, the number of layers in deep learning networks is increasing, the network layers are becoming richer, and the topology structure is also becoming more complex. The depth of neural networks has evolved from more than a dozen layers of VGG networks to hundreds or even thousands of layers of ResNet networks, including not only convolutional layers, deep separable convolutions, fully connected layers, loop layers, up and down sampling, scale transformations, and activation function layers with a large number of different functions, but also complex multi-layer connection topologies such as residual connections, long and short memory networks, and Transformers. Due to significant differences in computing and memory access characteristics among different network layers, there are also significant differences in the optimal hardware architecture that matches them. For example, hardware designed for convolutional network characteristics can only achieve less than 20% performance when running Long Short Term Memory (LSTM) networks.
The performance and versatility of AI specialized chips are a balanced process. The stronger the performance and lower the power consumption of a chip on certain specific network structures, the less flexible it is and the worse its versatility is. For example, Google TPU can easily achieve much higher performance and energy efficiency than GPU, but this comes at the cost of sacrificing the programmability and versatility of the chip. At present, the network architecture used in different application scenarios such as voice, text, image, and video cannot be completely unified. There are also certain differences in deep learning networks that do not require scenarios and tasks in the same field. New deep learning algorithms and network structures are still evolving, and it is possible that AI chips have not yet been put into production and launched. The current network architecture has been phased out and replaced by other better network architectures.
At present, AI chip companies adopt different technical solutions and strategies. The most radical companies adopt algorithm solidification solutions, which have the shortest chip development cycle and can achieve the ultimate performance and power consumption ratio of a single algorithm. However, it limits the universality and flexibility of chips, such as the first generation Google TPU; One type is to upgrade existing programmable processors to achieve a good balance between performance and chip versatility, with relatively controllable costs. Currently, GPUs are still the mainstream in this category. Another type is to design a brand new chip architecture, which can achieve a better balance between performance and versatility. However, the investment cost in chip development is high and requires a longer research and development cycle, such as Cambrian NPUs and Google TPUs.
At present, the development of AI chips is still in its early stages, and the market is mainly dominated by customized specialized AI chips and weakly programmed AI chips with certain flexibility. With the continuous development and improvement of algorithms and chip manufacturing processes, high-performance AI chips that support potential new network architecture features and have sufficient flexibility and scalability will gradually arrive. Referring to the development history of GPUs, early GPUs were specifically designed for graphic acceleration calculations. Due to limitations in production processes and processes, customized specialized hardware design schemes must be adopted to meet the performance and power consumption requirements of graphic rendering. With the development of the image acceleration industry and the rapid iteration and update of algorithms, as well as the improvement of chip generation technology, GPGPUs with weak programming characteristics have gradually emerged. Later, CUDA has made GPUs more programmable and greatly expanded their application fields, enabling GPUs to not only accelerate video rendering, but also perform more general parallel computing tasks such as scientific analysis, astronomical computing, and AI acceleration.
Solemnly declare that the article only represents the author's views and does not represent the views of our company. The copyright of this article belongs to the original author, and the reprint of the article is only for the purpose of disseminating more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you for your attention!