URGENT Site Reliability Engineer needed - Stockholm - Long-term!
Posted on Jan 15, 2020 by Linksap Europe Ltd
My customer has a newly formed delivery team which main responsibility is to provide connectivity services availability and performance.
We are searching for a creative, senior, hands-on operational SRE who will help us to grow and evolve.
You will be a part of a distributed team, located in Stockholm and St Petersburg (Russia). Current team consist of 2 SRE and we have an ambition to grow to 5 SRE coming year.
90% of our work includes monitoring, controlling and scaling microservice's based connectivity platform deployed in IBM cloud. Platform is connected with multiple systems, which are mostly hosted by IT or other suppliers.
Currently we have a strong focus on monitoring side and are eager to move proactive approach with predictive analytics and behavior analysis.
Our infrastructure is based on:
- Distributed microservices architecture
- Java and Node.js Back End applications
- Orchestration: Kubernetes, Terraform
- Messaging: Kafka, MQTT
- Database: DB2, MongoDB, Redis
- Evolve how we work with container deployment and orchestration at scale
- Maintain the Kubernetes clusters in different regions
- Build automated infrastructure to deliver metrics from production environments
- Monitoring, alerting, and incident resolution, provide root cause analysis for incidents
- Identify performance bottlenecks
- Infrastructure as code
- Automation as much as possible
- Continuous improvement of the infrastructure
A successful candidate should have:
- Mcs/Bcs or higher degree in relevant area
- 3+ years of experience in hosting and operating microservices based systems
- 3+ years of experience in running Cloud based production environments
- Experience with Docker, Kubernetes, Linux
- Monitoring systems like Prometheus, Grafana, Opsgenie, etc
- Log management systems like Splunk, Stackdriver, etc
- Experience with CI/CD pipelines
- Good programming skills
- Proficient in English
Good to have
- IoT platform experience
- Building scalable platform experience
- Experience with disaster recovery
- Experience with Chaos Engineering
- Strong troubleshooting skills