Platform Engineer - Observability
Posted on Jun 12, 2026 by Swisstech Recruitment
Not Specified, United Kingdom
IT
1 Jun 2026
£500 - £500 Daily
Contract/Project
Key Responsibilities:
Observability Platform Implementation:
- Deliver the implementation of the observability platform based on Grafana Mimir, Loki, Tempo, Grafana Alloy and Grafana Enterprise tooling.
- Design and implement highly available observability services across multiple co-location and production sites.
- Configure telemetry ingestion pipelines for metrics, logs, and future distributed tracing workloads.
- Develop and maintain observability architecture documentation, high-level designs, low-level designs, and operational runbooks.
- Define platform standards for telemetry collection, labelling, metadata enrichment, retention policies, and data governance.
- Implement multi-tenant observability controls and tenant isolation strategies.
- Configure and maintain object-storage-backed telemetry platforms for long-term retention and scalability.
Telemetry Collection & Integration:
- Deploy and manage Grafana Alloy collectors across Kubernetes clusters, Linux hosts, network infrastructure, storage platforms, and hardware management systems.
- Integrate telemetry from Kubernetes, GPU infrastructure, HPE hardware, storage platforms, network devices, and cloud-native services.
- Develop and maintain observability integrations using OpenTelemetry standards and protocols.
- Establish onboarding processes for new platforms, applications, and infrastructure services.
- Collaborate with application teams to define observability requirements and future tracing adoption strategies.
Alerting & Operational Insights:
- Design and implement alerting frameworks using recording rules, AlertManager, and operational best practices.
- Develop operational dashboards and service health views for infrastructure, platform, and application services.
- Support integration of observability events with ITSM and incident-management platforms.
- Define SLIs, SLOs, alert thresholds, and operational KPIs.
- Continuously improve platform observability, incident detection, and root-cause analysis capabilities.
Reliability & Automation:
- Implement Infrastructure-as-Code and GitOps practices for observability platform deployment and configuration management.
- Develop automation for dashboard provisioning, alert deployment, tenant onboarding, and telemetry configuration.
- Design and validate disaster recovery, resilience, and failover capabilities across observability services.
- Contribute to platform security, compliance, and operational governance initiatives.
- Work with operational teams to ensure observability services remain reliable, scalable, and maintainable.
Required Experience & Skills:
- Significant experience implementing and operating enterprise observability or monitoring platforms.
- Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles.
- Experience with Grafana ecosystem technologies including Grafana, Prometheus, Grafana Mimir, Grafana Loki, Grafana Tempo, and Grafana Alloy.
- Experience designing Kubernetes-native solutions and operating distributed platforms at scale.
- Knowledge of Linux systems administration and cloud-native infrastructure.
- Experience implementing Infrastructure-as-Code and GitOps approaches (preferably including Ansible).
- Skilled in developing automation and operational tooling using Python and/or Go.
- Previous exposure to creating technical architecture, operational documentation, and deployment designs.
- Experience with object storage technologies and distributed data platforms.
- Strong understanding of monitoring, alerting, and operational event management.
Reference: 3121705477
https://jobs.careeraddict.com/post/113401943
Platform Engineer - Observability
Posted on Jun 12, 2026 by Swisstech Recruitment
Not Specified, United Kingdom
IT
1 Jun 2026
£500 - £500 Daily
Contract/Project
Key Responsibilities:
Observability Platform Implementation:
- Deliver the implementation of the observability platform based on Grafana Mimir, Loki, Tempo, Grafana Alloy and Grafana Enterprise tooling.
- Design and implement highly available observability services across multiple co-location and production sites.
- Configure telemetry ingestion pipelines for metrics, logs, and future distributed tracing workloads.
- Develop and maintain observability architecture documentation, high-level designs, low-level designs, and operational runbooks.
- Define platform standards for telemetry collection, labelling, metadata enrichment, retention policies, and data governance.
- Implement multi-tenant observability controls and tenant isolation strategies.
- Configure and maintain object-storage-backed telemetry platforms for long-term retention and scalability.
Telemetry Collection & Integration:
- Deploy and manage Grafana Alloy collectors across Kubernetes clusters, Linux hosts, network infrastructure, storage platforms, and hardware management systems.
- Integrate telemetry from Kubernetes, GPU infrastructure, HPE hardware, storage platforms, network devices, and cloud-native services.
- Develop and maintain observability integrations using OpenTelemetry standards and protocols.
- Establish onboarding processes for new platforms, applications, and infrastructure services.
- Collaborate with application teams to define observability requirements and future tracing adoption strategies.
Alerting & Operational Insights:
- Design and implement alerting frameworks using recording rules, AlertManager, and operational best practices.
- Develop operational dashboards and service health views for infrastructure, platform, and application services.
- Support integration of observability events with ITSM and incident-management platforms.
- Define SLIs, SLOs, alert thresholds, and operational KPIs.
- Continuously improve platform observability, incident detection, and root-cause analysis capabilities.
Reliability & Automation:
- Implement Infrastructure-as-Code and GitOps practices for observability platform deployment and configuration management.
- Develop automation for dashboard provisioning, alert deployment, tenant onboarding, and telemetry configuration.
- Design and validate disaster recovery, resilience, and failover capabilities across observability services.
- Contribute to platform security, compliance, and operational governance initiatives.
- Work with operational teams to ensure observability services remain reliable, scalable, and maintainable.
Required Experience & Skills:
- Significant experience implementing and operating enterprise observability or monitoring platforms.
- Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles.
- Experience with Grafana ecosystem technologies including Grafana, Prometheus, Grafana Mimir, Grafana Loki, Grafana Tempo, and Grafana Alloy.
- Experience designing Kubernetes-native solutions and operating distributed platforms at scale.
- Knowledge of Linux systems administration and cloud-native infrastructure.
- Experience implementing Infrastructure-as-Code and GitOps approaches (preferably including Ansible).
- Skilled in developing automation and operational tooling using Python and/or Go.
- Previous exposure to creating technical architecture, operational documentation, and deployment designs.
- Experience with object storage technologies and distributed data platforms.
- Strong understanding of monitoring, alerting, and operational event management.
Reference: 3121705477
Share this job:
Alert me to jobs like this:
Amplify your job search:
Expert career advice
Increase interview chances with our downloads and specialist services.
Visit Blog