Interests & Projects

Efficient LLM


LLM models are over 1000× larger than traditional AI, creating major efficiency challenges. We focus on enabling efficient LLM acceleration through compression-based (such as quantization, sparsity, etc.) software-hardware co-design across ASICs, FPGAs, GPUs, and PIMs. Our methods support compressed inference, fine-tuning, and training, aligning model structures with underlying hardware for optimal performance. We further explore KV cache compression, sparse attention, and long-context learning to reduce runtime cost. Beyond traditional platforms, we explore neuromorphic computing paradigms, leveraging event-driven SNNs and in-memory computing to realize orders-of-magnitude gains in energy efficiency.