Chief / Senior Manager - Site Reliability Engineering -Application (Production Support) Navi Mumbai
Job Description:
- Be responsible for production support & release management for application assigned - SRE C1 - Elastic Stack : ELK , Application Performance Management : APM and Disaster Recovery (DR).
- Should possess excellent troubleshooting and analytical skills.
- This senior leadership role requires strong technical expertise, strategic thinking, and proven experience in managing mission-critical systems at scale.
Elastic Stack (ELK) Cluster Lead
Architect, deploy, and optimize ELK clusters for enterprise observability.
Ensure log ingestion, parsing, and visualization meet compliance and
performance standards.
Drive automation for scaling, resilience, and performance tuning.
Application Monitoring Management (APM) Cluster Lead
Define and implement APM strategy across critical applications.
Lead deployment and integration of APM tools (Dynatrace, AppDynamics,
New Relic, Datadog etc..).
Establish KPIs, SLAs, and proactive monitoring frameworks to ensure
application reliability.
Design synthetic monitoring for different critical business journey & key
metrics.
Disaster Recovery(DR) Oversight
Own DR strategy, planning, and execution for enterprise applications.
Conduct regular DR drills, audits, and compliance checks.
Align DR processes with business continuity and regulatory requirements.
Ensuring the robust replication between primary & secondary sites.
Oversee daily backup(s).
Ensuring all Disaster recovery process and documentation meets oblication
mandate by the regulators.
Provide comprehensive Audit reports for DR/DC environments.
Lead the command structure when disruptive event occurs and direct the
recovery team such network, database, application etc.
Co-ordinate the dissemination of critical information for senior management
& external stakeholders.
Conduct through evaluation of incidents to determine failures and issues.
SRE Practices
Champion SRE principles: reliability, scalability, automation, and continuous
improvement.
Monitor error budgets, SLIs, SLOs, and SLAs for critical systems.
Drive incident management, root cause analysis, and long-term remediation.
Company Profile
A leading Non-Banking --- Company (NBFC) that caters to the growing needs of an Aspirational India, serving both Individual & Business Clients.Incorporated
Apply Now
- Interested candidates are requested to apply for this job.
- Recruiters will evaluate your candidature and will get in touch with you.