Software Engineer III, Site Reliability Engineering, Google Cloud

United States•Hybrid remote

About the job

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on our systems capacity and performance.

Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

With your technical expertise you will manage project priorities, deadlines, and deliverables. You will design, develop, test, deploy, maintain, and enhance software solutions.

Responsibilities

Write product or system development code.
Review code developed by other engineers and provide feedback to ensure best practices (e.g., style guidelines, checking code in, accuracy, testability, and efficiency).
Contribute to existing documentation or educational content and adapt content based on product/program updates and user feedback.
Triage product or system issues and debug/track/resolve by analyzing the sources of issues and the impact on hardware, network, or service operations and quality.
Participate in, or lead design reviews with peers and stakeholders to decide amongst available technologies.

Senior Principal Site Reliability Engineer

Zillow

About the team

The SRE team at Zillow Group empowers ZG Product Teams to efficiently run “Zillow 2.0” services by reducing human error, aggressively focusing on automation, and providing deep insight into application behavior and health! We do that by incorporating aspects of software engineering and applying them to infrastructure and operations problems as a way to create and manage scalable and reliable distributed software systems.

About the role

We are looking for a Principal SRE or DevOps engineer with a demonstrated track record of building secure, large scale, highly available services using automation and Infrastructure as Code, who is well versed in cloud architecture (with a focus on Kubernetes), and loves to delight the engineers they support

As a Senior Principal SRE, you will:

Architect, develop and deploy systems, processes and environments that support hundreds of services and microservices developed by engineers across ZG.
Interact with the engineers we support and other internal stakeholders as a consultant and spokesperson for the work, direction and philosophy of SRE.
Ensure developers across Zillow Group create the systems and infrastructure that powers “Zillow 2.0”.
Consulte on design and implementation decisions, and assisting them in troubleshooting and debugging related problems.

This role has been categorized as a Remote position. “Remote” employees do not have a permanent corporate office workplace and, instead, work from a physical location of their choice which must be identified to the Company. Employees may live in any of the 50 US States, with limited exceptions. In certain cases, an employee in a remote-designated job may need to live in a specific region or time zone to support customers or clients as part of their role.

In Colorado, Connecticut, Nevada and New York City the standard base pay range for this role is $215,600.00 - $344,400.00 Annually. This base pay range is specific to Colorado, Connecticut, Nevada and New York City and may not be applicable to other locations.

In addition to a competitive base salary this position is also eligible for equity awards based on factors such as experience, performance and location. Actual amounts will vary depending on experience, performance and location.

Who you are

8+ years of relevant SRE, DevOps, Systems Engineering or Infrastructure Engineering experience.
A proven track record as a technical leader with broad multi-functional impact. Previous experience as a people manager is a large plus.
Experience with SDLC principles, architecture and operations.
Experience with Infrastructure as Code tools and processes.
Experience scripting/coding with Python, Java and/or Go.
Experience writing comprehensive documentation.
Experience working with senior leadership both inside and outside of engineering.
Excellent written and verbal communication skills.
Passionate about building and fostering good engineering practices and processes.

Site Reliability/Chaos Engineer

Fidelity TalentSource

Fidelity TalentSource is your destination for discovering your next temporary role at Fidelity Investments. We are currently sourcing for a Chaos Engineer to work in Fidelity’s Site Reliability Center of Excellence in Durham, NC.

The Role

Workplace Investing (WI) is seeking a Site Reliability Engineering (SRE) Chaos Engineering Contractor with 10+ years of industry experience.

We are looking for a Chaos Engineering lead who combines strategic thought leadership skills, a strong development & automation background and sound business judgment. As a Chaos Engineering Lead, you will actively contribute to the day-to-day planning, design, execution, and reporting of chaos testing. You will also bring industry experience and “outside in” thought leadership to discover new opportunities, drive efficiencies in testing and to influence future Chaos Engineering standards and best practices.

This is an exciting opportunity to join a passionate SRE Centre of Excellence (COE) team who are dedicated to providing a truly predictable customer experience. Under times of market volatility and high volumes, there is an increased expectation of a consistent service level. In WI, we strive to meet this expectation by building reliability into our ecosystem. This will be achieved though defining & implementing practices in Resiliency Engineering, Automation, Observability & Chaos Testing while also engraining a proactive Chaos Culture that thinks reliability first design.

The Expertise and Skills You Bring

Proven track record performing chaos testing to build confidence in the system's capability to withstand turbulent conditions in production
Possess an architectural mindset with proven ability to review architecture to derive Chaos Strategy and expose vulnerabilities.
Experience working with standalone applications and across E2E Customer Transactions (multi app)
Possess an automation mentality to drive scalable Chaos approaches
Hands on Experience with Cloud technologies (AWS/Azure)
Excellent Java and Groovy Shell scripting skills.
Proven Hands-on experience working with modern container services (Docker/Kubernetes)
Proven Hands-on experience working with Web Services and Databases
Strong understanding of CI/CD Engineering
Strong understanding of Quality Engineering
Proven use of Chaos engineering tools (e.g. Gremlin)
Strong understanding of Performance testing tools
Strong understanding of Observability tooling (e.g. Datadog, Grafana, Kibana)
Strong understanding in API testing tools (SoapUI, Postman, Soatest)
Understanding of Agile Methodology

Behavioral

Thought leadership and Research capabilities
Ability to evaluate and propose best-of-breed tools and engineering best-practices
Deeply self-motivated with the ability to work independently, coordinating activities within cross-regional and multi-functional teams
Ability to deal with ambiguous situations
A passion for excellence, innovation, and teamwork; eager to learn and adapt every day
Proven track record to quickly learn, adapt and thrive in a fast paced, dynamic and deadline driven environment
Excellent Communication Skills

Preferred:

Experience in AI/ML
Expertise in Angular and Python

The Team

The SRE COE comprises of a team of passionate experts dedicated to deriving and implementing site reliability practices across a number of key workstreams, including, Observability, Resiliency, Chaos Engineering and Operations.

You will have accountability for delivering strategic change across a diverse set of applications, technologies, and squads.