The content on this page was provided by an independent third party and syndicated by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

Printed Word Reviews Announces BookCAMP 2026: A Landmark Event for the Global Publishing Community in Newark

Printed Word Reviews Announces BookCAMP 2026: A Landmark Event for the Global Publishing Community in Newark

With Expanded Tracks in Writing, Marketing, and the Business of Publishing—Plus a New Book Fair Fundraiser—This Year’s

March 17, 2026

American Vision Windows Announces March ‘Golden Opportunity’ with Zero Tax & Free Installation on Windows and Doors

American Vision Windows Announces March ‘Golden Opportunity’ with Zero Tax & Free Installation on Windows and Doors

Homeowners Across California Can Save Big This March with Exclusive Window, Door, and Bath Remodeling Offers at

March 17, 2026

Tabuga Think Tank presents its first report Perspectives on digitalization of the Dominican Republic

Tabuga Think Tank presents its first report Perspectives on digitalization of the Dominican Republic

The report was developed from interviews with leaders of the national technology ecosystem. SANTO DOMINGO, DN,

March 17, 2026

Students Turn Raw News Data Into Visual Stories at 2026 Newsmatics Hackathon in Brno

Students Turn Raw News Data Into Visual Stories at 2026 Newsmatics Hackathon in Brno

High school, undergraduate and graduate students competed over 24 hours to analyze news trends, forecast future cycles,

March 17, 2026

Your Doctors Online Reports Serving More Than 1 Million Patients Through Its Virtual Healthcare Platform

Your Doctors Online Reports Serving More Than 1 Million Patients Through Its Virtual Healthcare Platform

Your Doctors Online says its telehealth platform has now served more than one million patients, reflecting growing

March 17, 2026

Author Michaele Aldophe Announces New Romantic Novel ‘Still, I Remember You’

Author Michaele Aldophe Announces New Romantic Novel ‘Still, I Remember You’

A heartfelt story of love, distance, and destiny set between the romantic streets of Paris and the breathtaking shores

March 17, 2026

Why Patients Are Traveling to Playa del Carmen for Veneers and Cosmetic Dentistry in Mexico

Why Patients Are Traveling to Playa del Carmen for Veneers and Cosmetic Dentistry in Mexico

A1 Smile Design explains the types of dental veneers available in Mexico, their benefits, and why Playa del Carmen is a

March 17, 2026

Marcus Jordan Announced as 2026 Recording Artist of the Year Award

Marcus Jordan Announced as 2026 Recording Artist of the Year Award

The Gospel Artist Celebrates Award Win With New Music Announcement LOS ANGELES, CA, UNITED STATES, March 17, 2026

March 17, 2026

InSkin Laser Aesthetics Introduces the Matrix® Skin Renewal Platform: A Revolutionary Approach to Skin Health

InSkin Laser Aesthetics Introduces the Matrix® Skin Renewal Platform: A Revolutionary Approach to Skin Health

At InSkin Laser Aesthetics, our goal has always been to provide treatments that deliver real, visible results while

March 17, 2026

AUVSI CEO Testifies on Risks of Chinese Robotics and AI

AUVSI CEO Testifies on Risks of Chinese Robotics and AI

Securing America’s leadership in robotics will require both carrots and sticks.”— AUVSI President & CEO Michael

March 17, 2026

The Mahdavi Law Firm Launches Personal Injury Claims Quiz for Texans

The Mahdavi Law Firm Launches Personal Injury Claims Quiz for Texans

The Mahdavi Law Firm PLLC Announces the Launch of Its Personal Injury Claims Quiz, Giving Texans a New Way To Evaluate

March 17, 2026

SYDNEY BASED BLOG CHICKS LIFESTYLE MAGAZINE COMMENCE FEATURES ON AN ARRAY OF MOBILE PHONE RELATED MATTERS

SYDNEY BASED BLOG CHICKS LIFESTYLE MAGAZINE COMMENCE FEATURES ON AN ARRAY OF MOBILE PHONE RELATED MATTERS

Management of Blog Chicks confirmed to Metro Cities Media they will commence monthly feature posts in March ranging

March 17, 2026

Kilgore, Texas Series Debuts on ‘Gone to Texas’ Business Podcast Highlighting East Texas Manufacturing

Kilgore, Texas Series Debuts on ‘Gone to Texas’ Business Podcast Highlighting East Texas Manufacturing

Company leaders share stories of workforce strength, industrial readiness, and business growth in East Texas. Kilgore’s

March 17, 2026

EPC Group Launches AI Decision Intelligence Framework for Microsoft Power BI

EPC Group Launches AI Decision Intelligence Framework for Microsoft Power BI

New framework combines Copilot, Claude, ChatGPT, Gemini, Perplexity, and multi-model LLMs to transform Power BI and

March 17, 2026

MRC Rocket Inc Launches Full-Service Digital Marketing Agency for E-Commerce and Small Businesses

MRC Rocket Inc Launches Full-Service Digital Marketing Agency for E-Commerce and Small Businesses

MRC Rocket Inc launches digital marketing services including SEO, PPC, social media, and content strategy for

March 17, 2026

6 Reasons Why Today’s Construction Labor Environment Will Likely Increase Disputes & Litigation In 2026

6 Reasons Why Today’s Construction Labor Environment Will Likely Increase Disputes & Litigation In 2026

Fundamental changes in the U.S. construction labor market have occurred affecting costs, availability, capabilities and

March 17, 2026

Dr. Renee Thompson Announced as a Pre-Conference Speaker at the 2026 AONL Annual Conference in Chicago

Dr. Renee Thompson Announced as a Pre-Conference Speaker at the 2026 AONL Annual Conference in Chicago

Creating a healthy work culture doesn’t happen by chance, It happens when leaders are equipped to address behavior, set

March 17, 2026

LET’S TALK WOMXN CHICAGO PRESENTS THEIR SIXTH ANNUAL WOMEN’S HISTORY MONTH CELEBRATION ‘RETRO REVOLUTION DANCE PARTY’

LET’S TALK WOMXN CHICAGO PRESENTS THEIR SIXTH ANNUAL WOMEN’S HISTORY MONTH CELEBRATION ‘RETRO REVOLUTION DANCE PARTY’

Spend the evening in an unabashed celebration of women empowering women; this celebration is for all of Chicago

March 17, 2026

The Book of Revelation: Revealing the Salvation of God by Hegumen Abraam Sleman Now Available

The Book of Revelation: Revealing the Salvation of God by Hegumen Abraam Sleman Now Available

A Gospel-centered interpretation of Revelation revealing God’s salvation, Christ’s victory, and hope The Book of

March 17, 2026

Palm Beach Tan Tyler Expands Into Wellness With Red Light Therapy and Infrared Sauna Services

Palm Beach Tan Tyler Expands Into Wellness With Red Light Therapy and Infrared Sauna Services

Tyler location adds Red Light Therapy and Infrared Sauna to complement its premier tanning services TYLER, TX, UNITED

March 17, 2026

Injury Care Solutions Group: A Well-Known Wide Receiver and the Lisfranc Injury Explained

Injury Care Solutions Group: A Well-Known Wide Receiver and the Lisfranc Injury Explained

Dr. Greg Vigna highlights wide receiver's resilience after injury and underscores the value of evidence-based expert

March 17, 2026

Turf Distributors Expands Fulfillment with Strategic Transition of Cut & Deliver Operations to Ewing Outdoor Supply

Turf Distributors Expands Fulfillment with Strategic Transition of Cut & Deliver Operations to Ewing Outdoor Supply

Partnership Strengthens Nationwide Distribution, Enhances Contractor Access to Premium Turf Products Transitioning our

March 17, 2026

Mindmachines.com Introduces Enhanced RoshiWave Mind Machine with Advanced Brainwave Disentrainment Technology

Mindmachines.com Introduces Enhanced RoshiWave Mind Machine with Advanced Brainwave Disentrainment Technology

Dallas, Texas – March 17, 2026 – PRESSADVANTAGE – Mindmachines.com has announced significant enhancements to its

March 17, 2026

Rigert Treppenlifte AG Expands Home Elevators Installation Services for Two Story Homes Across Switzerland

Rigert Treppenlifte AG Expands Home Elevators Installation Services for Two Story Homes Across Switzerland

Küssnacht am Rigi, SZ – March 17, 2026 – PRESSADVANTAGE – Rigert Treppenlifte AG, a leading Swiss mobility solutions

March 17, 2026

Kick It 3v3 Soccer Announces 2026 World Tour with Events in Cities Hosting World Cup Matches

Kick It 3v3 Soccer Announces 2026 World Tour with Events in Cities Hosting World Cup Matches

Denver, Colorado – March 17, 2026 – PRESSADVANTAGE – Kick It 3v3 Soccer, a 3v3 soccer tournament series in the United

March 17, 2026

STT Security Services Reveals Importance of Emergency Response Coordination in Security Services

STT Security Services Reveals Importance of Emergency Response Coordination in Security Services

MT. PLEASANT, MI – March 17, 2026 – PRESSADVANTAGE – STT Security Services has revealed the importance of emergency

March 17, 2026

East Dulwich Invisible Braces Teeth Straightening Dentist Dr Mori Shahid Recommends Invisalign Consultations at The Gardens Dental Centre (Smile 4 U)

East Dulwich Invisible Braces Teeth Straightening Dentist Dr Mori Shahid Recommends Invisalign Consultations at The Gardens Dental Centre (Smile 4 U)

London, England – March 17, 2026 – PRESSADVANTAGE – The Gardens Dental Centre (Smile 4 U) in East Dulwich has announced

March 17, 2026

First Black Person Expands to 37 Profiles Across 250 Years of History

First Black Person Expands to 37 Profiles Across 250 Years of History

March 17, 2026 – PRESSADVANTAGE – First Black Person, an educational reference documenting historic achievements by

March 17, 2026

Big Easy Paintings Expands Service Offerings With Professional Paint Color Selection for Homeowners

Big Easy Paintings Expands Service Offerings With Professional Paint Color Selection for Homeowners

NEW ORLEANS, LA – March 17, 2026 – PRESSADVANTAGE – Big Easy Paintings has formalized its Paint Color Selection service

March 17, 2026

Dietz Electric Expands Custom Motor Modification Capabilities

Dietz Electric Expands Custom Motor Modification Capabilities

MILWAUKEE, WI – March 17, 2026 – PRESSADVANTAGE – Dietz Electric has announced the expansion of its custom motor

March 17, 2026

Time Off Editing Announces Expanded Real Estate Photo Editing Services to Support Property Marketing and Visual Presentation

Time Off Editing Announces Expanded Real Estate Photo Editing Services to Support Property Marketing and Visual Presentation

Los Angeles, California – March 17, 2026 – PRESSADVANTAGE – Time Off Editing has announced the continued development of

March 17, 2026

Amana Care Clinic Announces Enhanced Walk-In Medical Services Across Quad Cities Region

Amana Care Clinic Announces Enhanced Walk-In Medical Services Across Quad Cities Region

DAVENPORT, Iowa – March 17, 2026 – PRESSADVANTAGE – Amana Care Clinic has announced enhanced walk-in medical services

March 17, 2026

Now Available: New Leadership Book No Shortcuts: What It Really Takes Confronts the Problem of Leadership Drift

Now Available: New Leadership Book No Shortcuts: What It Really Takes Confronts the Problem of Leadership Drift

Released during National Ethics Month, the book is already drawing attention from business leaders across industries.

March 17, 2026

McCarthy & Akers, PLC Sharpens Its Sole Focus on Estate Planning

McCarthy & Akers, PLC Sharpens Its Sole Focus on Estate Planning

McCarthy & Akers Announces Its Exclusive Focus on Estate Planning, Dedicating Full Attention to Holistic,

March 17, 2026

SPARK ’26 Brings Together Tamil Tech Entrepreneurs, Investors, Industry Leaders for a National Innovation Summit in NJ

SPARK ’26 Brings Together Tamil Tech Entrepreneurs, Investors, Industry Leaders for a National Innovation Summit in NJ

SPARK represents the energy and momentum of Tamil entrepreneurs in the technology sector,”— representatives from the

March 17, 2026

SAF Win: Post Office Carry Ban Injunction Covers Current and Future Members

SAF Win: Post Office Carry Ban Injunction Covers Current and Future Members

SAF Win: Post Office Carry Ban Injunction Covers Current and Future Members This is a huge win for current and future

March 17, 2026

Broadway Welcomes a New Wave of Shows as the World Cup Draws Worldwide Visitors

Broadway Welcomes a New Wave of Shows as the World Cup Draws Worldwide Visitors

NEW YORK, NY, UNITED STATES, March 17, 2026 /EINPresswire.com/ — As the NYC area prepares to host soccer fans during

March 17, 2026

AGPROfessionals Founder Tom Haren Named a 2026 ‘Leader in Agriculture’ by Denver Business Journal

AGPROfessionals Founder Tom Haren Named a 2026 ‘Leader in Agriculture’ by Denver Business Journal

GREELEY, CO, UNITED STATES, March 17, 2026 /EINPresswire.com/ — AGPROfessionals proudly announces that Founder and CEO

March 17, 2026

BaRupOn Healthcare Strengthens U.S. Medical Infrastructure

BaRupOn Healthcare Strengthens U.S. Medical Infrastructure

BaRupOn Healthcare integrates pharmacy, distribution, and biomedical innovation to strengthen U.S. healthcare supply

March 17, 2026

Aesthetic Expert Linda Rank Featured at VIP Oscars Gifting Lounge in Beverly Hills

Aesthetic Expert Linda Rank Featured at VIP Oscars Gifting Lounge in Beverly Hills

VIP Beverly Hills gifting lounge featured national trainer Linda Rank of Orange County, known for natural results and

March 17, 2026