Sharding Large Language Models: Achieving Efficient Distributed Inference

Techniques to load LLMs on smaller GPUs and enable parallel inference using Hugging Face Accelerate