SRE Team Lead/Manager - AWS - Azure - Terraform - London

Mentmore Recruitment

Posted on Feb 28, 2025 by Mentmore Recruitment
London, United Kingdom
IT
Immediate Start
£120k - £120k Annual
Full-Time

Lead Site Reliability Engineer - Azure/AWS - Terraform - London.

My financial services client are looking for a Hands On Lead Site Reliability engineer who will be responsible for ensuring the reliability, scalability for their infrastructure and services. This is a senior role requiring technical expertise, leadership, and a commitment to continuous improvement. You must have team lead/mentoring experience and be able to balance technical delivery, team productivity, performance measurement, and collaboration across teams and stakeholders.

Duties & Responsibilities:

  • Hands-On Engineering & Technical Leadership
  • Design, develop, and maintain cloud infrastructure (Azure/AWS) using Terraform and automation.
  • Lead troubleshooting, performance optimisation, and incident resolution to enhance reliability.
  • Ensure best practices in CI/CD pipelines, observability, and infrastructure deployment.
  • Promote Transparency, Inspection, and Adaptation by making both system and team health data accessible and actionable.
  • Work with engineering leads, business stakeholders, and the Head of Platform Operations to define and enforce SLAs, SLOs, and engineering standards that support scalability, reliability, and operational efficiency.
  • Design solutions with a systems-thinking approach, ensuring infrastructure, observability, and automation strategies support sustainable growth.
  • Support capacity planning, scalability, and security best practices, proactively identifying risks and opportunities to enhance platform resilience.

Experience Required:

  • Proven leadership experience in technical teams, with a focus on mentoring, professional development, and fostering a culture of innovation, reliability, and engineering excellence.
  • Proven experience in Site Reliability Engineering, DevOps, or Systems Engineering, with hands-on experience in both Azure and AWS environments.
  • Demonstrable expertise in high-performance, scalable, and highly available systems, with experience in optimising reliability, capacity planning, and system performance.
  • Deep expertise in DevOps principles, including automation, infrastructure as code (Terraform, Ansible, or Chef), GitOps workflows, CI/CD best practices (GitHub Actions, GitLab CI/CD, Azure DevOps), and collaborative ways of working.
  • Strong background in containerisation (Docker) and orchestration (Kubernetes), with a focus on scalability and resilience.
  • Hands-on experience with monitoring, observability, and incident management tools (Prometheus, Grafana, ELK, Azure Monitor, Application Insights, Kusto) and a data-driven approach to improving system reliability.
  • Strong understanding of regulatory and security requirements, such as ISO 27001, PCI DSS, CE+ and SOX, with experience implementing compliance-driven engineering practices.
  • Advocate for modern DevOps and SRE best practices, championing collaboration, transparency, automation, continuous learning, and continuous improvement across teams.
  • Excellent communication skills, able to engage stakeholders, collaborate cross-functionally, and drive alignment on reliability and operational priorities.


Reference: 2905237621

https://jobs.careeraddict.com/post/100518642

This Job Vacancy has Expired!

Mentmore Recruitment

SRE Team Lead/Manager - AWS - Azure - Terraform - London

Mentmore Recruitment

Posted on Feb 28, 2025 by Mentmore Recruitment

London, United Kingdom
IT
Immediate Start
£120k - £120k Annual
Full-Time

Lead Site Reliability Engineer - Azure/AWS - Terraform - London.

My financial services client are looking for a Hands On Lead Site Reliability engineer who will be responsible for ensuring the reliability, scalability for their infrastructure and services. This is a senior role requiring technical expertise, leadership, and a commitment to continuous improvement. You must have team lead/mentoring experience and be able to balance technical delivery, team productivity, performance measurement, and collaboration across teams and stakeholders.

Duties & Responsibilities:

  • Hands-On Engineering & Technical Leadership
  • Design, develop, and maintain cloud infrastructure (Azure/AWS) using Terraform and automation.
  • Lead troubleshooting, performance optimisation, and incident resolution to enhance reliability.
  • Ensure best practices in CI/CD pipelines, observability, and infrastructure deployment.
  • Promote Transparency, Inspection, and Adaptation by making both system and team health data accessible and actionable.
  • Work with engineering leads, business stakeholders, and the Head of Platform Operations to define and enforce SLAs, SLOs, and engineering standards that support scalability, reliability, and operational efficiency.
  • Design solutions with a systems-thinking approach, ensuring infrastructure, observability, and automation strategies support sustainable growth.
  • Support capacity planning, scalability, and security best practices, proactively identifying risks and opportunities to enhance platform resilience.

Experience Required:

  • Proven leadership experience in technical teams, with a focus on mentoring, professional development, and fostering a culture of innovation, reliability, and engineering excellence.
  • Proven experience in Site Reliability Engineering, DevOps, or Systems Engineering, with hands-on experience in both Azure and AWS environments.
  • Demonstrable expertise in high-performance, scalable, and highly available systems, with experience in optimising reliability, capacity planning, and system performance.
  • Deep expertise in DevOps principles, including automation, infrastructure as code (Terraform, Ansible, or Chef), GitOps workflows, CI/CD best practices (GitHub Actions, GitLab CI/CD, Azure DevOps), and collaborative ways of working.
  • Strong background in containerisation (Docker) and orchestration (Kubernetes), with a focus on scalability and resilience.
  • Hands-on experience with monitoring, observability, and incident management tools (Prometheus, Grafana, ELK, Azure Monitor, Application Insights, Kusto) and a data-driven approach to improving system reliability.
  • Strong understanding of regulatory and security requirements, such as ISO 27001, PCI DSS, CE+ and SOX, with experience implementing compliance-driven engineering practices.
  • Advocate for modern DevOps and SRE best practices, championing collaboration, transparency, automation, continuous learning, and continuous improvement across teams.
  • Excellent communication skills, able to engage stakeholders, collaborate cross-functionally, and drive alignment on reliability and operational priorities.

Reference: 2905237621

CareerAddict

Alert me to jobs like this:

Amplify your job search:

CV/résumé help

Increase interview chances with our downloads and specialist services.

CV Help

Expert career advice

Increase interview chances with our downloads and specialist services.

Visit Blog

Job compatibility

Increase interview chances with our downloads and specialist services.

Start Test