Fundamental
-
2024-08-21, [Fundamental] FlashDecoding Series
-
2024-08-19, [Fundamental] 模型量化
-
2024-08-11, [Fundamental] 旋转位置编码(RoPE)
-
2024-08-10, [Fundamental]From Online Softmax to Flash Attention V3
Distributed System
-
2023-01-19, The Design of a Practical System for Fault-Tolerant Virtual Machines
-
2023-01-18, Google File System(GFS)
-
2023-01-17, MapReduce
量化
-
2024-08-15, ✅[April-May 2024] 模型量化之 🥕Quarot & SpinQuant
里程碑,旋转矩阵缓解Outliers
-
2024-06-25, ✅[Oct 2023] 模型量化之QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
MLSys2024, MoE量化
-
2024-05-25, ✅[April 2024] 模型量化之AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
MLSys 2024 Best Paper
KV Cache/Prompt Cache/Attention Acceleration
-
2024-08-10, [Fundamental]From Online Softmax to Flash Attention V3
里程碑,FA 1-3
-
2024-06-21, ✅[April 2024] Prompt Cache: Modular Attention Reuse for Low-Latency Inference
MLSys 2024,prompt cache优化
Edge
-
2024-06-17, ✅[Mar 2024] Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs
边缘Transformer部署优化 from OPPO