hero

We bind our fortunes to those who dare to burn
away the obsolete and forge the unimagined future.

Reliability and Failure Analysis Engineer

Cerebras

Cerebras

Sunnyvale, CA, USA
Posted 6+ months ago
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach empowers machine learning users to effortlessly prototype and experiment with large-scale ML applications, without the hassle of managing multiple GPUs or TPUs.
Cerebras' current customers include national labs, global corporations across multiple industries, and top-tier healthcare systems. In January, we announced a multi-year, multi-million-dollar partnership with Mayo Clinic, underscoring our commitment to transforming AI applications across various fields.
The Role

As a Reliability and Failure Analysis Engineer at Cerebras, you will be responsible for conducting in-depth failure analysis on Cerebras products and delivering detailed reports on findings. This role involves hands-on lab work, including troubleshooting, repairing, and assembling products and sub-assemblies. You will establish and execute product audits from the manufacturing line to ensure both functionality and reliability through comprehensive testing processes. Additionally, you will support reliability testing and AI hardware validation to guarantee long-term performance, durability, and suitability for AI workloads. Collaboration across teams, including Engineering, Manufacturing Engineering, Quality Assurance is key for receiving assignments and providing critical insights into product performance and quality.

Responsibilities
  • Utilize advanced techniques such as SEM, FTIR, and other spectroscopy methods to perform detailed physical failure analysis on AI hardware components, identifying issues in solder joints, internal connections, and overall material integrity.
  • Identify and assess product risks using Design for Reliability (DfR) mechanisms, collaborating with design teams to mitigate potential failures during development.
  • Provide hands-on leadership in experimental design, test development, and execution for reliability testing on AI hardware.
  • Perform reliability predictions of failure mechanisms and conduct risk assessments for products under development and those in the field.
  • Conduct Failure Mode and Effect Analysis (FMEA) to develop and implement robust maintenance plans, including Preventive and Predictive Maintenance (PPM) for AI hardware components.
  • Lead root cause failure analysis for AI hardware systems, setting trigger points and implementing corrective actions based on findings.
  • Analyze failure and cost data using reliability analysis tools to develop strategies for improving the reliability and performance of AI hardware.
  • Collaborate with engineering, purchasing, and quality assurance teams to integrate Design for Reliability principles into new hardware designs and capital projects.
  • Lead reliability improvement initiatives, coordinating detailed physical failure analysis to enhance product durability and performance under AI workloads.
  • Review and analyze inspection reports, leveraging data-driven insights to evaluate inspection quality, fitness for service, and appropriate maintenance or inspection intervals.
Skills & Qualifications
  • M.S. or PhD in Materials Science or a related field such as Applied Physics or Mechanical Engineering, with 8-10 years of proven experience in failure analysis or related disciplines.
  • Strong hands-on expertise in materials characterization and spectroscopy techniques (e.g., FTIR, ESCA, XRF, Raman, Dual Beam FIB/SEM) and experience with non-metal and composite material failure analysis.
  • Experience in AI hardware reliability testing and failure analysis, with a deep understanding of the performance demands of AI workloads.
  • Proven ability to perform and coordinate both electrical and physical failure analysis to support product development, design debugging, yield improvement, and quality control for Cerebras products.
  • Familiarity with project management tools such as Confluence and Jira to track progress and collaborate effectively across teams.
  • Excellent communication skills, both verbal and written, with the ability to present complex technical findings to cross-functional partners.
  • Exceptional problem-solving skills, with the ability to prioritize and manage multiple high-priority tasks simultaneously, delivering timely results in a fast-paced environment.
Why Join Cerebras

People who are serious about software make their own hardware. At Cerebras we have built a breakthrough architecture that is unlocking new opportunities for the AI industry. With dozens of model releases and rapid growth, we’ve reached an inflection point in our business. Members of our team tell us there are five main reasons they joined Cerebras:

  1. Build a breakthrough AI platform beyond the constraints of the GPU
  2. Publish and open source their cutting-edge AI research
  3. Work on one of the fastest AI supercomputers in the world
  4. Enjoy job stability with startup vitality
  5. Our simple, non-corporate work culture that respects individual beliefs

Read our blog: Five Reasons to Join Cerebras in 2024.

Apply today and become part of the forefront of groundbreaking advancements in AI.

Cerebras Systems is committed to creating an equal and diverse environment and is proud to be an equal opportunity employer. We celebrate different backgrounds, perspectives, and skills. We believe inclusive teams build better products and companies. We try every day to build a work environment that empowers people to do their best work through continuous learning, growth and support of those around them.


This website or its third-party tools process personal data. For more details, click here to review our CCPA disclosure notice.