hero

We bind our fortunes to those who dare to burn
away the obsolete and forge the unimagined future.

Senior SRE Engineer

Fortanix

Fortanix

Software Engineering
Bengaluru, Karnataka, India
Posted on Aug 23, 2024

As a Senior Site Reliability Engineer at Fortanix, you will be at the forefront of ensuring the reliability, scalability, and performance of our cutting-edge production environments. You’ll design and build operations as code, architecting automated solutions that enhance system stability. Partnering closely with our product engineering teams, you'll have a hands-on role in continuously improving the reliability of our platforms, ensuring our systems are robust and resilient. You'll develop and implement a comprehensive, actionable monitoring framework that detects and prevents issues before they impact our users.

In this role, you'll be a critical part of our production on-call rotation, responding to incidents with agility and executing post-incident reviews to drive continuous improvement. If you’re passionate about automation, enjoy tackling complex reliability challenges, and thrive in a fast-paced, high-impact environment, this role is for you!

Join us to shape the future of secure computing with a focus on building reliable, scalable, and secure production systems.

Key Responsibilities

  • System Architecture & Design
    • Collaborate with software development teams to design scalable, reliable, and secure systems.
    • Architect and build robust infrastructure to handle growth and ensure system uptime.
  • Automation & Infrastructure as Code (IaC)
    • Automate infrastructure deployment and management using tools like Terraform, Ansible, or CloudFormation.
    • Implement continuous integration and continuous deployment (CI/CD) pipelines for automated testing and deployment.
    • Write automation scripts and code for scaling and self-healing systems.
  • Monitoring & Incident Management
    • Design and implement comprehensive monitoring and alerting solutions to detect anomalies and issues before they impact users.
    • Implement logging and observability tools to gain insight into system health and performance (e.g., Prometheus, Grafana, ELK stack).
    • Manage on-call rotations, ensure timely responses to incidents, and perform root cause analysis and post-mortems.
  • Performance Tuning & Optimization
    • Perform load testing and system benchmarking to identify performance bottlenecks.
    • Optimize application and infrastructure performance, reducing latency and improving response times.
  • Security & Compliance
    • Ensure systems are secure by design, incorporating security best practices (e.g., encryption, firewalls, access controls).
    • Stay up-to-date with security vulnerabilities and patch systems accordingly.
    • Implement compliance standards (e.g., GDPR, HIPAA) where applicable.
  • Collaboration & Mentoring
    • Work closely with developers to ensure that applications are designed for reliability and scalability.
    • Serve as a mentor to junior engineers, fostering a culture of reliability and best practices.
    • Collaborate across teams (DevOps, Development, QA) to enhance system robustness.
  • Disaster Recovery & High Availability
    • Develop and maintain disaster recovery and business continuity plans.
    • Ensure systems are highly available, designing systems that can withstand failures without service disruptions.
  • Capacity Planning & Scalability
    • Forecast future system demand and plan for capacity increases as needed.
    • Design infrastructure that scales automatically to handle increased loads.
  • Continuous Improvement & Reliability Culture
    • Analyze incidents and failures to identify opportunities for improving system reliability.
    • Drive a culture of reliability across the engineering organization, advocating for best practices and SRE principles.
  • Cloud & Hybrid Infrastructure Management
    • Manage cloud infrastructure (AWS, GCP, Azure) and hybrid environments, ensuring optimal usage of cloud resources.
    • Implement cost optimization strategies for cloud resources while maintaining performance and reliability.

This role requires a deep understanding of both software engineering and infrastructure management, as well as strong collaboration and problem-solving skills