CareerAddict

Senior Site Reliability Engineer

Project Recruit

Posted on Jun 18, 2025 by Project Recruit
London, United Kingdom
IT
Immediate Start
Annual Salary
Contract/Project

Senior Site Reliability Engineer

Our client, a leading global supplier for IT services, requires Senior Site Reliability Engineer to be based at their client's office in London, UK.

This is a hybrid role based in the UK, requiring attendance at the London office three days per week (Tuesday to Thursday), with the flexibility to work remotely on Mondays and Fridays. The position also includes on-call responsibilities during both weekdays and weekends

This is a 6+ month temporary contract to start ASAP

Day rate: Competitive Market rate

As a DevOps Engineer, you will play a critical role in managing cloud infrastructure, ensuring the reliability of production systems, and improving end-to-end deployment pipelines. This role combines deep operational responsibilities with a strong focus on automation, observability, and continuous improvement. You will be responsible for maintaining high system availability, enabling rapid delivery through CI/CD, and supporting development teams with robust infrastructure and tooling. A key part of the role includes proactive monitoring using Prometheus, Grafana, and Splunk, as well as participating in on-call rotations to respond to live incidents. Collaboration across engineering, security, and product teams is essential to build scalable and resilient systems.

Key Responsibilities

  • Deploy, configure, and monitor AWS services ensuring high availability, scalability, and security
  • Respond to and resolve infrastructure and service incidents with root cause analysis and preventive measures
  • Handle change requests, track recurring issues, and work on long-term fixes to improve system stability
  • Implement and maintain observability solutions using Prometheus, Grafana, and Splunk
  • Spearhead the integration of Prometheus-Grafana stack in production environments, enhancing Real Time visibility and reducing incident detection time
  • Write PromQL queries for custom monitoring dashboards, alerting, and diagnostics
  • Manage and optimise CI/CD pipelines for automated testing, deployment, and rollback strategies
  • Develop and maintain automation scripts in Python, Bash, Go, or SQL for routine infrastructure tasks
  • Utilise Git-based workflows for infrastructure changes, version control, and automated deployments
  • Operate, troubleshoot, and optimise Kubernetes clusters and containerised workloads
  • Lead the refactoring of Legacy deployment processes into robust CI/CD pipelines using GitHub Actions and Kubernetes
  • Participate in a rotating on-call schedule to ensure 24/7 availability of production systems

Key Requirements

Essential:

  • Working knowledge and prior hands-on experience using AWS services at the DevOps Engineer level
  • Incident, change & problem management experience. This role is heavily operation-oriented, including on-call requirements
  • Proven track record of responding to critical AWS incidents under pressure, quickly identifying root causes, and deploying hotfixes using automation
  • Strong background in setup & operation of enterprise observability tooling, specifically Prometheus, Grafana and Splunk, including usage of PromQL
  • Proficient in one or more languages of Python, Go, Bash, SQL
  • Familiar with GitHub/GitOps/container orchestration/Kubernetes operations
  • Working configuration and deployment management experience with CI/CD

Desirable:

  • Hands-on experience with Terraform or CloudFormation for infrastructure provisioning and automation
  • Strong knowledge of Splunk for log analysis and troubleshooting
  • Strong problem-solving skills and analytical thinking

Special Working Conditions

  • The role includes participation in an on-call rotation covering both weekdays and weekends

Due to the volume of applications received, unfortunately we cannot respond to everyone.

If you do not hear back from us within 7 days of sending your application, please assume that you have not been successful on this occasion.


Reference: 2967006854

https://jobs.careeraddict.com/post/104566781

This Job Vacancy has Expired!

Project Recruit

Senior Site Reliability Engineer

Project Recruit

Posted on Jun 18, 2025 by Project Recruit

London, United Kingdom
IT
Immediate Start
Annual Salary
Contract/Project

Senior Site Reliability Engineer

Our client, a leading global supplier for IT services, requires Senior Site Reliability Engineer to be based at their client's office in London, UK.

This is a hybrid role based in the UK, requiring attendance at the London office three days per week (Tuesday to Thursday), with the flexibility to work remotely on Mondays and Fridays. The position also includes on-call responsibilities during both weekdays and weekends

This is a 6+ month temporary contract to start ASAP

Day rate: Competitive Market rate

As a DevOps Engineer, you will play a critical role in managing cloud infrastructure, ensuring the reliability of production systems, and improving end-to-end deployment pipelines. This role combines deep operational responsibilities with a strong focus on automation, observability, and continuous improvement. You will be responsible for maintaining high system availability, enabling rapid delivery through CI/CD, and supporting development teams with robust infrastructure and tooling. A key part of the role includes proactive monitoring using Prometheus, Grafana, and Splunk, as well as participating in on-call rotations to respond to live incidents. Collaboration across engineering, security, and product teams is essential to build scalable and resilient systems.

Key Responsibilities

  • Deploy, configure, and monitor AWS services ensuring high availability, scalability, and security
  • Respond to and resolve infrastructure and service incidents with root cause analysis and preventive measures
  • Handle change requests, track recurring issues, and work on long-term fixes to improve system stability
  • Implement and maintain observability solutions using Prometheus, Grafana, and Splunk
  • Spearhead the integration of Prometheus-Grafana stack in production environments, enhancing Real Time visibility and reducing incident detection time
  • Write PromQL queries for custom monitoring dashboards, alerting, and diagnostics
  • Manage and optimise CI/CD pipelines for automated testing, deployment, and rollback strategies
  • Develop and maintain automation scripts in Python, Bash, Go, or SQL for routine infrastructure tasks
  • Utilise Git-based workflows for infrastructure changes, version control, and automated deployments
  • Operate, troubleshoot, and optimise Kubernetes clusters and containerised workloads
  • Lead the refactoring of Legacy deployment processes into robust CI/CD pipelines using GitHub Actions and Kubernetes
  • Participate in a rotating on-call schedule to ensure 24/7 availability of production systems

Key Requirements

Essential:

  • Working knowledge and prior hands-on experience using AWS services at the DevOps Engineer level
  • Incident, change & problem management experience. This role is heavily operation-oriented, including on-call requirements
  • Proven track record of responding to critical AWS incidents under pressure, quickly identifying root causes, and deploying hotfixes using automation
  • Strong background in setup & operation of enterprise observability tooling, specifically Prometheus, Grafana and Splunk, including usage of PromQL
  • Proficient in one or more languages of Python, Go, Bash, SQL
  • Familiar with GitHub/GitOps/container orchestration/Kubernetes operations
  • Working configuration and deployment management experience with CI/CD

Desirable:

  • Hands-on experience with Terraform or CloudFormation for infrastructure provisioning and automation
  • Strong knowledge of Splunk for log analysis and troubleshooting
  • Strong problem-solving skills and analytical thinking

Special Working Conditions

  • The role includes participation in an on-call rotation covering both weekdays and weekends

Due to the volume of applications received, unfortunately we cannot respond to everyone.

If you do not hear back from us within 7 days of sending your application, please assume that you have not been successful on this occasion.

Reference: 2967006854

CareerAddict

Alert me to jobs like this:

Amplify your job search:

CV/résumé help

Increase interview chances with our downloads and specialist services.

CV Help

Expert career advice

Increase interview chances with our downloads and specialist services.

Visit Blog

Job compatibility

Increase interview chances with our downloads and specialist services.

Start Test