CareerAddict

Site Reliability Engineer

CV-Library

Posted on Jun 26, 2026 by CV-Library
Basingstoke, Hampshire, United Kingdom
IT
Immediate Start
Annual Salary
Full-Time - Remote
Site Reliability Engineer - Fully Remote

What We're Looking For

We're looking for someone who enjoys solving complex operational challenges through engineering rather than manual intervention. You'll be proactive, collaborative, and passionate about improving reliability through automation and continuous improvement.

If you're excited about building resilient cloud platforms and making a measurable impact on service reliability, we'd love to hear from you.

Key Responsibilities

Incident Management & Operations

Participate in a 24/7 on-call rota as a primary or escalation point
Lead or support major incident response, including triage, mitigation, and resolution.
Coordinate with Engineering, Infrastructure, Security, and Product teams during incidents.
Develop, maintain, and continuously improve operational runbooks and playbooks.
Conduct blameless post-incident reviews and drive follow-up improvements.Monitoring & Alerting

Monitor the health of infrastructure, applications, and services.
Design and optimise alerting strategies aligned with service reliability objectives (SLIs/SLOs).
Reduce alert fatigue through continuous tuning and optimisation.
Build and maintain dashboards using technologies such as:
Grafana
Prometheus
Datadog
Splunk
AWS CloudWatchReliability Engineering & Automation

Automate repetitive operational tasks to minimise manual effort.
Improve Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
Develop automation tools and scripts using Python, Bash, Go, or similar languages.
Implement self-healing and auto-remediation where appropriate.
Work closely with engineering teams to improve application and platform reliability.Platform & Infrastructure

Support and troubleshoot Linux-based production environments.
Manage cloud infrastructure, primarily within AWS
Support containerised environments using Docker and Kubernetes.
Assist with capacity planning, availability reviews, and production readiness for new releases.Skills & Experience

Essential

Strong Linux systems administration experience.
Experience supporting production environments and managing incidents.
Hands-on experience with AWS cloud infrastructure.
Experience with Docker and Kubernetes.
Scripting or programming experience with Python, Bash, Go, or similar.
Solid understanding of networking fundamentals, including DNS, TCP/IP, and load balancing.
Experience working in a 24/7 operations or NOC environment.
Ability to remain calm and effective during high-pressure production incidents.
Excellent communication and stakeholder coordination skills.Desirable

Experience working with Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Previous experience helping organisations transition from traditional NOC operations to an SRE model.
Infrastructure as Code experience using Terraform, Ansible, or similar tools.
Exposure to security, compliance, or regulated environments.

Spectrum IT Recruitment (South) Limited is acting as an Employment Agency in relation to this vacancy

Reference: 225298651

https://jobs.careeraddict.com/post/113469967
CV-Library

Site Reliability Engineer

CV-Library

Posted on Jun 26, 2026 by CV-Library

Print
Basingstoke, Hampshire, United Kingdom
IT
Immediate Start
Annual Salary
Full-Time - Remote
Site Reliability Engineer - Fully Remote

What We're Looking For

We're looking for someone who enjoys solving complex operational challenges through engineering rather than manual intervention. You'll be proactive, collaborative, and passionate about improving reliability through automation and continuous improvement.

If you're excited about building resilient cloud platforms and making a measurable impact on service reliability, we'd love to hear from you.

Key Responsibilities

Incident Management & Operations

Participate in a 24/7 on-call rota as a primary or escalation point
Lead or support major incident response, including triage, mitigation, and resolution.
Coordinate with Engineering, Infrastructure, Security, and Product teams during incidents.
Develop, maintain, and continuously improve operational runbooks and playbooks.
Conduct blameless post-incident reviews and drive follow-up improvements.Monitoring & Alerting

Monitor the health of infrastructure, applications, and services.
Design and optimise alerting strategies aligned with service reliability objectives (SLIs/SLOs).
Reduce alert fatigue through continuous tuning and optimisation.
Build and maintain dashboards using technologies such as:
Grafana
Prometheus
Datadog
Splunk
AWS CloudWatchReliability Engineering & Automation

Automate repetitive operational tasks to minimise manual effort.
Improve Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR).
Develop automation tools and scripts using Python, Bash, Go, or similar languages.
Implement self-healing and auto-remediation where appropriate.
Work closely with engineering teams to improve application and platform reliability.Platform & Infrastructure

Support and troubleshoot Linux-based production environments.
Manage cloud infrastructure, primarily within AWS
Support containerised environments using Docker and Kubernetes.
Assist with capacity planning, availability reviews, and production readiness for new releases.Skills & Experience

Essential

Strong Linux systems administration experience.
Experience supporting production environments and managing incidents.
Hands-on experience with AWS cloud infrastructure.
Experience with Docker and Kubernetes.
Scripting or programming experience with Python, Bash, Go, or similar.
Solid understanding of networking fundamentals, including DNS, TCP/IP, and load balancing.
Experience working in a 24/7 operations or NOC environment.
Ability to remain calm and effective during high-pressure production incidents.
Excellent communication and stakeholder coordination skills.Desirable

Experience working with Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Previous experience helping organisations transition from traditional NOC operations to an SRE model.
Infrastructure as Code experience using Terraform, Ansible, or similar tools.
Exposure to security, compliance, or regulated environments.

Spectrum IT Recruitment (South) Limited is acting as an Employment Agency in relation to this vacancy
Print

Reference: 225298651

Share this job:
CareerAddict

Alert me to jobs like this:

Amplify your job search:

CV/résumé help

Increase interview chances with our downloads and specialist services.

CV Help

Expert career advice

Increase interview chances with our downloads and specialist services.

Visit Blog

Job compatibility

Increase interview chances with our downloads and specialist services.

Start Test

Similar Jobs

Site Reliability Engineer

Hove, East Sussex, United Kingdom

Site Reliability Engineer

City of London, City and County of the City of London, United Kingdom

Site Reliability Engineer

Bromley, Greater London, United Kingdom

Lead Site Reliability Engineer

Ashbourne, Derbyshire, United Kingdom