In this blog we will explore Fully Sharded Data Parallelism (FSDP), which is a technique that allows for the training of large Neural Network models in a distributed manner efficiently. We’ll examine FSDP from a bird’s eye view and shed light on the underlying mechanisms. Why FSDP? When