Building Open Systems for AI and the AI Track at OCP Global Summit 2024

Thursday, October 03, 2024 · Posted by Bijan Nowroozi, Chief Technology Officer

“Whatever good things we build end up building us.” I read this quote by Jim Rohn years ago and it really rings true these days. Let me explain. The Open Compute Project (OCP) Global Summit 2024 is fast approaching and we have been building an amazing program. Let’s look at one track specifically.

This marks the inaugural year of the OCP’s Open Systems for AI Strategic Initiative—a coordinated effort across OCP’s projects focused on tackling the challenges of delivering sustainable, efficient, and high-performance AI solutions. The OCP Foundation has organized a special AI Track at the Global Summit, serving as a forum to collaborate and showcase the efforts, results, and vision of the OCP community which includes a blend of early mover expertise from hyperscale datacenter operators to leading supply chain vendors for open systems in AI.

We assembled an AI Track covering a lot of the technical stack including Hardware, Software and Networking. It is jam packed with 21 total sessions with a mix of insightful presentations, panel discussions, and networking opportunities with industry leaders from OCP’s community of leading voices in AI being held on Wednesday, October 16, from 8:00 AM to 5:00 PM.

Here’s a sneak peek into what you can expect.

Firstly, a link to the AI Track Schedule.

The morning sessions begin with an introduction to OCP’s strategic initiatives in AI, followed by industry leaders from NVIDIA and Google addressing the immense demands AI places on data center infrastructure. Experts from AMD, NVIDIA, and Google will discuss advancements in data formats for deep learning, aiming to streamline data processing and storage. Cormac Brick from Google delves into the evolution of machine learning software ecosystems, while SK Hynix’s Euicheol Lim explores how memory-computing fusion technology is revolutionizing large language models (LLMs) from data centers to edge devices. David Schmidt of Dell wraps up the morning by discussing strategies for rapidly deploying open AI solutions at scale.

Midday highlights focus on scaling AI infrastructure efficiently. Representatives from Ampere, Arm, and Supermicro will explore delivering AI inference at scale, optimizing hardware and software for real-time services. Meta’s Salina Dbritto and Jeremy Yang address the complexities of integrating AI systems into existing infrastructures, highlighting best practices for seamless integration. Kiran Bhat from Solidgm discusses optimizing high-density storage for AI workloads, emphasizing scalability and cost-effectiveness. Lessons learned in orchestrating large-scale AI clusters are shared by Supermicro and Broadcom, followed by a panel on enhancing power efficiency and sustainability in AI data centers through renewable energy integration and advanced cooling technologies.

The afternoon sessions shift focus to networking and system optimization. AMD’s J Metz and Intel’s Uri Elzur present collaborative efforts between OCP and the Ultra Ethernet Consortium to advance high-performance networking solutions for AI and HPC. Kurtis Bowman of AMD and Fangzhi Wen of Alibaba introduce UALink, a new industry standard for high-speed, low-latency interconnection of AI accelerators, including end-user perspectives on scaling computational capabilities. Intel’s Deb Chatterjee unveils the Falcon Reliable Transport, enhancing RDMA solutions for faster data transfer and improved scalability. Jonathan Koomey and Hassan Moezzi discuss using digital twins for predictive maintenance and performance optimization in AI data centers. Updates on Meta’s FBOSS for next-generation AI fabrics and techniques for congestion management in Ethernet-based AI clusters are also presented.

The closing sessions feature a case study by Meta’s Anil Agrawal and Balaji Vembu on implementing out-of-band CPER logging to improve system reliability in AI/ML systems. Molex’s Chris Kapuscinski and Astera Lab’s Chris Petersen discuss how PCIe Active Electrical Cables enable the scaling of large language model computing clusters through enhanced signal integrity and system performance. NVIDIA’s David Iles and Barak Gafni address mitigating “noisy neighbors” in cloud networks using Spectrum-X technology to ensure consistent AI application efficiency. The day concludes with a panel of industry leaders discussing the future of open systems in AI, emphasizing the importance of openness and collaboration in fostering innovation.

Final Thoughts

The OCP Foundation team has curated an exciting lineup of speakers and topics for the OCP Global Summit, covering key areas of the AI stack. But as you can see from the thoughtfully assembled schedule, the AI Track is more than just the presentations; it’s a space for the OCP Community to come together, share ideas, and foster innovation, collaboration, and inspiration.

With the launch of the Open Systems for AI Strategic Initiative this year, we saw a unique opportunity to instead of focusing on hardware contributions, to provide a survey of the AI end to end infrastructure stack from both within and beyond current OCP projects. By doing so, we aim to strengthen future Open Systems for AI concepts, ensuring they reflect openness and cutting-edge thinking.

To follow up on these sessions and continue the conversations and momentum, head over to the OCP Open Systems for AI Strategic Initiative and subscribe for information on the open calls and development activities and progress.

The complete OCP Global Summit schedule is here and the AI Track schedule can be found here.

Mark your calendar for October 16, and don’t miss this opportunity to be part of the conversation shaping the future of open infrastructure for AI. We’ll see you there.