Back to jobs Featured

Site Reliability Engineer (SRE) - FS

Job description

Position Overview
We are seeking a skilled and motivated Site Reliability Engineer (SRE) with 3+ years of experience. The role focuses on ensuring the availability, performance, and scalability of platforms and services using modern observability and orchestration tools.

Key Responsibilities

  • Design and maintain scalable, resilient, and secure infrastructure.
  • Monitor system health using ELK, Grafana, Prometheus, and ITRS Geneos.
  • Manage containerized applications in Kubernetes environments.
  • Develop automated deployment pipelines and configuration management tools.
  • Collaborate with development and product teams to embed reliability into service design.
  • Support real-time data streaming platforms using Kafka.
  • Respond to incidents, conduct root cause analysis, and implement preventive measures.
  • Contribute to SRE best practices, documentation, and runbooks.

Required Skills & Experience

  • 3+ years in SRE, DevOps, or infrastructure-focused roles.
  • Hands-on experience with observability tools (ELK, Grafana, Prometheus, ITRS Geneos).
  • Proficiency in Kubernetes and container orchestration.
  • Experience with Kafka for data streaming.
  • Strong scripting skills (Python, Bash) and automation experience.
  • Solid understanding of Linux, networking, and cloud infrastructure.
  • Strong troubleshooting and collaboration skills.

Preferred Qualifications

  • Experience in financial services or regulated industries.
  • Familiarity with CI/CD pipelines and Infrastructure as Code (Terraform, Ansible).
  • Certifications in Kubernetes, AWS, or related technologies.