SRE Team Lead/Manager - AWS - Azure - Terraform - London

Mentmore Recruitment

Posted on Feb 18, 2025 by Mentmore Recruitment
London, United Kingdom
IT
Immediate Start
£120k - £120k Annual
Full-Time

SRE Team Lead/Manager - AWS - Azure - Terraform - London.

My financial services client are looking for a Site Reliability TeamLead/Manager who will be responsible for ensuring the reliability, scalability for their infrastructure and services. This is a senior role requiring technical expertise, leadership, and a commitment to continuous improvement. You must have team lead/mentoring experience and be able to balance technical delivery, team productivity, performance measurement, and collaboration across teams and stakeholders.

Duties & Responsibilities:

  • Hands-On Engineering & Technical Leadership
  • Design, develop, and maintain cloud infrastructure (Azure/AWS) using Terraform and automation.
  • Lead troubleshooting, performance optimisation, and incident resolution to enhance reliability.
  • Ensure best practices in CI/CD pipelines, observability, and infrastructure deployment.
  • Promote Transparency, Inspection, and Adaptation by making both system and team health data accessible and actionable.
  • Work with engineering leads, business stakeholders, and the Head of Platform Operations to define and enforce SLAs, SLOs, and engineering standards that support scalability, reliability, and operational efficiency.
  • Design solutions with a systems-thinking approach, ensuring infrastructure, observability, and automation strategies support sustainable growth.
  • Improve deployment pipelines, automation, and operational workflows across squads, fostering consistency and best practices.
  • Support capacity planning, scalability, and security best practices, proactively identifying risks and opportunities to enhance platform resilience.
  • Team Productivity, Performance & Agile Ways of Working

Experience Required:

  • Proven leadership experience in technical teams, with a focus on mentoring, professional development, and fostering a culture of innovation, reliability, and engineering excellence.
  • Proven experience in Site Reliability Engineering, DevOps, or Systems Engineering, with hands-on experience in both Azure and AWS environments.
  • Demonstrable expertise in high-performance, scalable, and highly available systems, with experience in optimising reliability, capacity planning, and system performance.
  • Deep expertise in DevOps principles, including automation, infrastructure as code (Terraform, Ansible, or Chef), GitOps workflows, CI/CD best practices (GitHub Actions, GitLab CI/CD, Azure DevOps), and collaborative ways of working.
  • Strong background in containerisation (Docker) and orchestration (Kubernetes), with a focus on scalability and resilience.
  • Hands-on experience with monitoring, observability, and incident management tools (Prometheus, Grafana, ELK, Azure Monitor, Application Insights, Kusto) and a data-driven approach to improving system reliability.
  • Strategic mindset, able to align technical initiatives with business goals, drive scalability and performance improvements, and proactively tackle complex challenges.


Reference: 2899747967

https://jobs.careeraddict.com/post/100012198

This Job Vacancy has Expired!

Mentmore Recruitment

SRE Team Lead/Manager - AWS - Azure - Terraform - London

Mentmore Recruitment

Posted on Feb 18, 2025 by Mentmore Recruitment

London, United Kingdom
IT
Immediate Start
£120k - £120k Annual
Full-Time

SRE Team Lead/Manager - AWS - Azure - Terraform - London.

My financial services client are looking for a Site Reliability TeamLead/Manager who will be responsible for ensuring the reliability, scalability for their infrastructure and services. This is a senior role requiring technical expertise, leadership, and a commitment to continuous improvement. You must have team lead/mentoring experience and be able to balance technical delivery, team productivity, performance measurement, and collaboration across teams and stakeholders.

Duties & Responsibilities:

  • Hands-On Engineering & Technical Leadership
  • Design, develop, and maintain cloud infrastructure (Azure/AWS) using Terraform and automation.
  • Lead troubleshooting, performance optimisation, and incident resolution to enhance reliability.
  • Ensure best practices in CI/CD pipelines, observability, and infrastructure deployment.
  • Promote Transparency, Inspection, and Adaptation by making both system and team health data accessible and actionable.
  • Work with engineering leads, business stakeholders, and the Head of Platform Operations to define and enforce SLAs, SLOs, and engineering standards that support scalability, reliability, and operational efficiency.
  • Design solutions with a systems-thinking approach, ensuring infrastructure, observability, and automation strategies support sustainable growth.
  • Improve deployment pipelines, automation, and operational workflows across squads, fostering consistency and best practices.
  • Support capacity planning, scalability, and security best practices, proactively identifying risks and opportunities to enhance platform resilience.
  • Team Productivity, Performance & Agile Ways of Working

Experience Required:

  • Proven leadership experience in technical teams, with a focus on mentoring, professional development, and fostering a culture of innovation, reliability, and engineering excellence.
  • Proven experience in Site Reliability Engineering, DevOps, or Systems Engineering, with hands-on experience in both Azure and AWS environments.
  • Demonstrable expertise in high-performance, scalable, and highly available systems, with experience in optimising reliability, capacity planning, and system performance.
  • Deep expertise in DevOps principles, including automation, infrastructure as code (Terraform, Ansible, or Chef), GitOps workflows, CI/CD best practices (GitHub Actions, GitLab CI/CD, Azure DevOps), and collaborative ways of working.
  • Strong background in containerisation (Docker) and orchestration (Kubernetes), with a focus on scalability and resilience.
  • Hands-on experience with monitoring, observability, and incident management tools (Prometheus, Grafana, ELK, Azure Monitor, Application Insights, Kusto) and a data-driven approach to improving system reliability.
  • Strategic mindset, able to align technical initiatives with business goals, drive scalability and performance improvements, and proactively tackle complex challenges.

Reference: 2899747967

CareerAddict

Alert me to jobs like this:

Amplify your job search:

CV/résumé help

Increase interview chances with our downloads and specialist services.

CV Help

Expert career advice

Increase interview chances with our downloads and specialist services.

Visit Blog

Job compatibility

Increase interview chances with our downloads and specialist services.

Start Test