Job Description
Job Description
We are seeking a skilled and detail-oriented Observability Engineer to join our remote team. In this role, you will be responsible for designing, implementing, and maintaining observability solutions that ensure high availability, performance, and reliability of our systems. Your work will empower teams with real-time insights through metrics, logs, and tracing, helping to drive faster incident response and better system understanding.
Key Responsibilities:
Develop and maintain observability platforms (e.g., Prometheus, Grafana, OpenTelemetry, ELK, Datadog, New Relic)
Design and implement monitoring strategies across distributed systems
Collaborate with DevOps, SRE, and engineering teams to define SLIs, SLOs, and dashboards
Create automated alerts and integrations to improve incident detection and resolution
Analyze performance data and logs to identify trends, bottlenecks, and areas for optimization
Ensure observability tools are reliable, secure, and scalable
Provide documentation and training to empower teams to use observability tools effectively
Qualifications:
2+ years of experience in observability, monitoring, or site reliability engineering
Strong knowledge of monitoring, logging, and tracing tools and best practices
Experience with cloud infrastructure (AWS, GCP, or Azure)
Proficiency in scripting or automation (e.g., Python, Bash, Terraform)
Excellent problem-solving skills and ability to work independently in a remote setting