Lalit Chourey
Lead Software Engineer at Meta Platforms INC
About Lalit in his own words:
With 11+ years of software engineering experience at Meta and Microsoft, I have consistently spearheaded impactful projects that have significantly benefited billions of Meta users and millions of Microsoft enterprise customers. Since joining Meta in January 2020, I have been spearheaded into leading the Meta's LLM training infrastructure Here are some of my most significant contributions in this area
SYSTEM TO ORCHESTRATE AND SCHEDULE LARGE SCALE TRAINING
As a lead software engineer I designed and led a team of 10+ engineers to build a 0 to 1 distributed training scheduler to run and monitor ML model training on several thousand GPUs concurrently. This new System is a multi tenant platform for different teams within Meta and supports features like granular GPU capacity allocation at org, team or sub-team levels and supports capacity borrowing between teams to optimally use the available GPU capacity. this system also brings in several efficiencies over the previous legacy system and reduces the time to launch a Training on several thousand GPUs from 40 mins to only a few mins. Features like in place restart and resuming training from checkpoints helps improve the overall training efficiency and training time. This system manages a fleet of several hundred thousand GPUs valued at billions of dollars and the efficiency gains in GPU utilizations had saved several million dollars in savings for the company. This System is now being used to train several thousand models of different sizes including LLAMA ( Meta's large language model).
GENERATIVE AI TRAINING OVSERVABILITY
Over the past year, I have dedicated my efforts to spearheading the development of a cutting-edge Generative AI training observability system. This system provides invaluable insights into the training process by gathering terabytes of telemetry data and generated training artifacts from thousands of training hosts, and then analyzing them in real-time. Training Large Language Models (LLMs) presents several unique challenges, including hardware limitations, massive datasets, and the complexity of the models themselves. This system plays a crucial role in addressing these challenges by providing real-time visibility into the training process, enabling early detection of potential issues, and facilitating optimization for improved performance and efficiency. Beyond traditional system-level monitoring, the system I designed is essential for tracking training progress in Generative AI. Unlike conventional tools like Tensorboard, which lack robust multi-modality support for generated text and media, this training observability system offers comprehensive support for all training artifacts. This enables Gen AI researchers to subjectively evaluate the quality of generated content during the evaluation process. We are in the process of patenting this technology and contemplating open-sourcing it to empower the broader AI community. We believe it has the potential to become the go-to tool for Generative AI use cases, akin to what Tensorboard represents for traditional deep learning.
Despite being developed within Meta, The training infrastructure I lead has played a pivotal role in training Meta's open-source LLaMA model. As the most widely used open-source model, boasting approximately 350 million downloads as of July 2024, LLaMA's success underscores the substantial indirect impact my work has had on the broader AI landscape (https://ai.meta.com/blog/llama-usage-doubled-may-through-july-2024/).
In addition to my primary duties at Meta, I actively contribute to the broader AI community through a variety of channels:
RESEARCH PAPER REVIEWER FOR PREMIER AI JOURNALS & CONFERENCES I have been integral in maintaining the integrity of academic contributions within the AI sector by reviewing research papers for prominent journals. This role involves providing detailed, constructive feedback to authors to help refine their submissions and enhance their contributions to the scientific community. I am an active reviewer for the following AI journals and conferences:
[AI JOURNALS]
IEEE Transactions on Neural Networks and Learning Systems (TNNLS) (Impact factor 10.4)
IEEE Transactions on Artificial Intelligence (TAI)
Springer Neural Processing Letters (NPL)
[TECH CONFERENCES]
* The 39th Annual AAAI Conference on Artificial Intelligence (IAAI, Philadelphia, USA)
* 11th International Conference on Soft Computing & Machine Intelligence (ISCMI, Melbourne, Australia)
TECHNICAL BOOK REVIEWER I review technical books for notable publishers, ensuring high standards and providing essential feedback to authors and publishers to uphold the quality of technical literature.
* Packt Publication (https://www.packtpub.com/)
* Manning Publications (https://www.manning.com/)
* O’Reilly Media (https://www.oreilly.com/)
PUBLISHING SCHOLARLY ARTICLES I have written scholarly articles highlighting the newest trends and applications of AI to foster knowledge and excitement about the field. Here are some of my contributions:
* Harnessing the Power of LLMs: A New Era of Intelligent Cybersecurity (https://www.rsaconference.com/library/blog/harnessing-the-power-of-llms-a-new-era-of-a-intelligent-cybersecurity)
* Generative AI: The Next Frontier in Technological Evolution and Ethical Debate
(https://techbullion.com/generative-ai-the-next-frontier-in-technological-evolution-and-ethical-debate/)
* Towards Truly Autonomous AI Agents: Bridging the Gap Between Reactive Responses and Proactive Behaviors (https://www.irjmets.com/uploadedfiles/paper//issue_7_july_2024/60636/final/fin_irjmets1722238916.pdf)
JUDGING HACKATHONS I am set to judge at HACK MIT 2024, providing critical assessment of participant submissions.
ML BOOTCAMP AND LAB AT META I lead ML Bootcamps at Meta, offering engineers hands-on experience in machine learning. These bootcamps have been well-received, with over 30 engineers from my organization participating monthly.