Site Reliability Engineer Gitaly -Remote
Project Detail
Gitaly is the Git data storage tier of GitLab, providing a reliable, secure and fast distributed Git data store over gRPC. For more information about Gitaly, see the team’s Direction page.
Gitaly’s high-availability storage requires developers who understand distributed storage systems, their management, observability and availability. Cluster team contributes features, fixes bugs and improves performance of this software stack.
Currently, we're building a new distributed cluster solution and improvements to our Disaster Recovery readiness.
What you’ll do
- Work with peer SREs to maintain Gitaly’s environments within GitLab’s SaaS offerings, including cost and performance optimization, capacity planning, migrations and debugging production issues.
- Participate in architectural discussions and decisions surrounding Gitaly, within the greater GitLab ecosystem.
- Design RPC interfaces for the Gitaly service.
- Scope, estimate and describe tasks to reach the team’s goals.
- Develop production automation and tooling for Gitaly, for use both in SaaS and self-managed installations.
- Help ensure that Gitaly development tooling, releases and other processes serve the team and the product’s goals.
- Develop Gitaly in accordance with the product’s goals and a focus on reliability and maintainability.
- Instrument, monitor and profile Gitaly in the production environment.
- Build dashboards and alerts to monitor the health of your services.
- Conduct acceptance testing of the features you’ve built.
- Educate all team members on best practices relating to high availability.
- Write performant, maintainable, and elegant code and peer review others’ code.
- Be positive and solution-oriented.
- Constantly improve the quality & security of the product.
- Take initiative in improving the software in small or large ways to address pain points in your own experience as a developer.
- Qualify developers for hiring.
- Respond to user emergencies, platform alerts and support requests, including regular on-call duties.
What you’ll bring
- Mandatory: experience running highly-available systems in production environments at scale.
- Mandatory: hands-on experience with Cloud technologies including Kubernetes.
- Mandatory: proven professional experience building, debugging, optimizing software in large-scale, high-volume environments.
- Mandatory: proven professional experience writing and testing high-quality code.
- Mandatory: a good understanding of building instrumented, observable software systems.
- Highly desirable: Experience with Terraform infrastructure as code.
- Highly desirable: proven professional experience writing and testing quality code in Go.
- Highly desirable: a good understanding of git’s internal data structures or experience running git servers.
- Highly desirable: experience with gRPC.
- Highly desirable: willingness to learn Ruby.
Interested?? Click me to apply