Deep Dive: TidalGuard: Quarantining Bad Hardware without Service Disruption
Speaker: Michael Sevilla, TidalScale
TidalGuard: Quarantining Bad Hardware without Service Disruption
Abstract: TidalScale has created a distributed hypervisor (also called a Software-Defined Server) capable of running unmodified software operating systems, libraries, databases, and applications on a cluster of cooperating standard physical servers. Using our software, customers create a virtual machine that aggregates all the resources of the cluster. The system self-optimizes by migrating resources amongst servers over standard ethernet, guided by machine learning algorithms and behavioral introspection. Our newest release includes a feature called TidalGuard that monitors hardware behavior and allows customers to swap out a node without application or operating system downtime. TidalGuard quarantines the outgoing node and evicts compute, storage, and IO to other nodes, then makes the outgoing node available for repair, update, or upgrade. The swap time is only bounded by the active resources on the outgoing node and the network bandwidth, but even in the worst case, we only need to migrate a fraction of the virtual machine's working set. In the more common case when the working set on the outgoing server is small, fewer of the server's memory pages and CPUs need to be evicted and performance is much better. With TidalGuard, we build on the scalability achievements of previous releases with added resiliency, reliability, and availability.
Bio: Michael is a software engineer at TidalScale working on distributed computation and storage. He received his PhD from the University of California, Santa Cruz, studying file system metadata load balancing, consistency/durability semantics, and namespace structures. Previously, he worked at Los Alamos National Laboratories, where he designed storage systems for HPC applications, and Hewlett Packard Enterprise's Advanced Development Team, where he worked on storage solutions and reproducibility tools for big data processing stacks.
This event is open to current CRSS sponsors. Please contact Cynthia McCarley by to obtain information on how to join the Zoom teleconference. Video and slides from the event will be available to CRSS sponsors; please contact Cynthia McCarley to get them.
Wednesday, April 27, 2022 at 11:00 AM
Zoom (Link available by invitation)