Site Reliability Engineer

  • Job types information is not available.
  • Germany

Site Reliability Engineer

Experian

Company Description

Experian® Decision Analytics (DA) integrates predictive data and analytics into valuable business decisions that provide greater insight into decision performance and helps companies keep pace with changing business priorities. By applying expert consulting, analytical tools, software and systems to convert data into valuable business decisions.


Our expertise spans a variety of industries and we provide software to some of the world’s largest finance, telcos and other blue-chip companies. The crown jewel in our software suite is PowerCurve which provides best-in-class decisioning applied across the whole customer life-cycle from customer acquisition to in-life and collections, as well as in fraud detection and identity resolution systems. PowerCurve is able to execute on hosted and cloud platforms.


The ideal candidate will have experience of operations, a passion for automation and an interest in software development or they will have experience of software development, a passion for automation and an interest in operational excellence. If you have incident manager skills and are able to manage rationally and calmly during a crisis that would be an added bonus. There is an expectation to work occasional peak weekends as well as some on call requirements. This is the beginning of a growing team and we are looking for individuals to grow with it.


As a senior engineer we will be looking to you to role model behaviours as well as give technical leadership within the team.

You will lead the team’s technical vision bridging the gap across platforms, infrastructure, automation and software. You will be able to review and design nonfunctional requirements, prioritise key areas of operational architecture and guide both operational staff and software feature engineers on SRE best practice.



Job Description

Primary Accountabilities:

  • Uptime of Experian One – Experian’s Cloud SaaS offering for Decision Analytics.


Significant Demands:

  • Enhancing and automation of the Monitoring and Alerting of our platform
  • Responding to incidents and restoring service, but also identifying issues before they happen.
  • Gaining a strong understanding of the systems to efficiently triage issues and find owners for problem resolution
  • An ability to identify an issue or a manual process and ensure that they never occur again
  • Incident management; able to co-ordinate others and be co-ordinated during service disruptions with a focus on restoring availability
  • Ability to write complex queries using various tools, an ability to lead others to excel in this field
  • Reviewing systems designs and implementations to identify resiliency, scalability and monitoring issues prior to implementation
  • Strong Knowledge of Kubernetes, Infrastructure as Code, High availability principles.
  • Excellent communication skills in English with colleagues across the globe.


Working Practices and Relationships:

  • Strong relationships with other members of the SRE team, primary based in Kuala Lumpur but also London, Arizona, Sofia
  • Working relationships with colleagues in other departments, third parties who support backing applications.
  • Collaborative relationships with developers, security and architects to influence them to build resilient, maintainable solutions
  • Proficiency in one programming or scripting language and willingness to apply software development best practices to an operational role


Principle Responsibilities

  • Work closely with the development team to bring new software releases to production more effectively
  • Develop and deploy OSS tools for use within the Cloud Operations group
  • Develop, deploy and manage a highly scalable and high-availability platform which monitor and maintain service performance and availability metrics
  • Automate everything from deployment, monitoring, management and incident response – treat ‘Everything as Code’ Take ownership of our configuration management platforms
  • Collaborate with developers to bring new features and services into production
  • Assist identifying and mitigating security threats to comply with a strict security compliance
  • Develop and improve operational practices and procedures
  • Produce high-level design documentation where required
  • Innovation in all areas- people, process, technology.
  • Ensure close collaboration between Development and Operations, enabling smoother operation between team.


More about you


  • Direct experience of supporting complex, highly scaled systems in production
  • Linux knowledge, experience troubleshooting and predicting issues in advance
  • Networking, troubleshooting and monitoring
  • Cloud Native application designs for high performance, scalability and resilience
  • Incident Management and co-ordination, Blameless PIRs
  • Kubernetes, OpenShift, Splunk, Dynatrace, Thousand Eyes, ServiceNow, Jira, Jenkins, Python, Prometheus
  • Java, Cassandra, Redis, RunDeck, MongoDB, Apigee, Okta, PostGres, AWS, Azure, GCP
  • Infrastructure as Code, Git Ops
  • Line management, coaching or mentoring.


Key Behaviors

  • Excellent communication skills. Written and verbal fluency in English is required
  • Highly organized and with a good attention to detail
  • #CustomerObsessed
  • Working across boundaries – geographically, teams, language and cultural
  • Curious and willing and able to learn new technologies and practices
  • Cloud aware, you understand how cloud technologies differ from other technical approaches and are able to explain these to others.
  • Lives and breathes availability and operational excellence in technology


Is this you?

  • You strive to remove repetitive tasks from your daily existence
  • You are a keen following of technology trends
  • You believe that software is to be used not to be admired.
  • You solve for the future as well as the immediate
  • You empower others to deliver
  • You develop trust, you make conflict constructive, create commitment, drive accountability and drive results
  • You are articulate, clear, concise, and you can tailor your approach to the audience
  • You can manage stakeholders at all levels and influence decision making


Why this role is critical to us


As part of the next phase in DA growth, Engineering Services is looking to expand the Site Reliability Engineering team to offer round the global cover. As an organisation we are fully convinced that everything should be automated and that software should run software and believe in the Site Reliability Engineering model. We have established a platform using cutting edge technology, such as Kubernetes, containers, pipelines and monitoring. Continuous improvements are required to ensure our platform and tools are reliable and scalable to meet the increasing demand in various regions in future. This position is crucial for us to maintain the pace in scaling the SRE team

Qualifications
  • Proven track record of managing AWS workloads, with particular focus on ECS and EKS clusters
  • Good understanding of key architectural principles such as high availability, disaster recovery and platform resiliency
  • DevOps skillset including Terraform and Jenkins
  • AWS optimisation techniques
  • Excellent interpersonal and communication skills
  • Ability to innovate and show initiative

Additional Information

Experian Careers – Creating a better tomorrow together


Find out what its like to work for Experian by clicking here

Source
remotive.com

Comments are closed.