A Survey of Transformer Compression Techniques for Edge Devices
Project: TinyAttention
Abstract
Deploying large language models on the edge requires aggressive compression. We survey quantization, pruning, knowledge distillation, and low-rank factorization, comparing them on a common benchmark of mobile-class hardware. We propose a decision chart that maps latency and memory budgets to recommended techniques.
No file attached to this sample paper.
Permalink: /paper/a-survey-of-transformer-compression-techniques-for-edge-devices-0a8e8c