hero

We bind our fortunes to those who dare to burn
away the obsolete and forge the unimagined future.

Senior SRE Engineer

Fortanix

Fortanix

Software Engineering
Bengaluru, Karnataka, India
Posted 6+ months ago

As a Senior Site Reliability Engineer at Fortanix, you will be at the forefront of ensuring the reliability, scalability, and performance of our cutting-edge production environments. You’ll design and build operations as code, architecting automated solutions that enhance system stability. Partnering closely with our product engineering teams, you'll have a hands-on role in continuously improving the reliability of our platforms, ensuring our systems are robust and resilient. You'll develop and implement a comprehensive, actionable monitoring framework that detects and prevents issues before they impact our users.

In this role, you'll be a critical part of our production on-call rotation, responding to incidents with agility and executing post-incident reviews to drive continuous improvement. If you’re passionate about automation, enjoy tackling complex reliability challenges, and thrive in a fast-paced, high-impact environment, this role is for you!

Join us to shape the future of secure computing with a focus on building reliable, scalable, and secure production systems.

Key Responsibilities

  • System Architecture & Design
    • Collaborate with software development teams to design scalable, reliable, and secure systems.
    • Architect and build robust infrastructure to handle growth and ensure system uptime.
  • Automation & Infrastructure as Code (IaC)
    • Automate infrastructure deployment and management using tools like Terraform, Ansible, or CloudFormation.
    • Implement continuous integration and continuous deployment (CI/CD) pipelines for automated testing and deployment.
    • Write automation scripts and code for scaling and self-healing systems.
  • Monitoring & Incident Management
    • Design and implement comprehensive monitoring and alerting solutions to detect anomalies and issues before they impact users.
    • Implement logging and observability tools to gain insight into system health and performance (e.g., Prometheus, Grafana, ELK stack).
    • Manage on-call rotations, ensure timely responses to incidents, and perform root cause analysis and post-mortems.
  • Performance Tuning & Optimization
    • Perform load testing and system benchmarking to identify performance bottlenecks.
    • Optimize application and infrastructure performance, reducing latency and improving response times.
  • Security & Compliance
    • Ensure systems are secure by design, incorporating security best practices (e.g., encryption, firewalls, access controls).
    • Stay up-to-date with security vulnerabilities and patch systems accordingly.
    • Implement compliance standards (e.g., GDPR, HIPAA) where applicable.
  • Collaboration & Mentoring
    • Work closely with developers to ensure that applications are designed for reliability and scalability.
    • Serve as a mentor to junior engineers, fostering a culture of reliability and best practices.
    • Collaborate across teams (DevOps, Development, QA) to enhance system robustness.
  • Disaster Recovery & High Availability
    • Develop and maintain disaster recovery and business continuity plans.
    • Ensure systems are highly available, designing systems that can withstand failures without service disruptions.
  • Capacity Planning & Scalability
    • Forecast future system demand and plan for capacity increases as needed.
    • Design infrastructure that scales automatically to handle increased loads.
  • Continuous Improvement & Reliability Culture
    • Analyze incidents and failures to identify opportunities for improving system reliability.
    • Drive a culture of reliability across the engineering organization, advocating for best practices and SRE principles.
  • Cloud & Hybrid Infrastructure Management
    • Manage cloud infrastructure (AWS, GCP, Azure) and hybrid environments, ensuring optimal usage of cloud resources.
    • Implement cost optimization strategies for cloud resources while maintaining performance and reliability.

This role requires a deep understanding of both software engineering and infrastructure management, as well as strong collaboration and problem-solving skills