ROLE DESCRIPTION:
- Collaborate with architects, business users, source (data) team members for discovering data sources and tech feasibility for fetching the same into the consolidated data environmentplatform
- Iteratively design and build core components of our Data Platform (Hortanworks Big Data implementations with ecosystem products such as Kafka, Spark Streaming, Hive, Sqoop, MySQL and
- Take ownership of design & development processes, ensuring incorporation of best practices, the sanity of code and versioning into different environments through tools like git, etc
- Ability to draw intelligence from structured, nonstructured, large and complex data sets to meet functional and non-functional business requirements
- Ability to steer the team members for unit development and troubleshooting the developmentproduction issues
- Proven ability to operate in a mixed mode of incremental development, production support and support to the business for their impromptu data quests
- Willingness to participate and provide adhereble inputs, triaging the dependencies to the quarterly (team deliverables) planning
- For this role, hands-on big data technologies such as Spark(Streaming), Hive, Sqoop, etc and previous experience in data team is non-negotiable. Spark, Kafka, Hadoop
- Iteratively design and build core components of our Data Platform
- Process structured and unstructured data, validate data quality, help to design automated data quality tests in the Data Platform.
- Interact with a large range of proprietary and open data sources, leveraging data in all sorts of formats including files - Assist in the development and support of data sets required by the business.
- Collaborate with architects, engineers, data scientist, business team members on project planning and goals.
- Assemble large, complex data sets that meet functional and non-functional business requirements
- Identify, design, and implement internal process improvements - automating manual processes, optimizing data delivery, re-designing infrastructure for greater scalability
- Build the infrastructure required for optimal extraction, transformation, and loading of data from a wide variety of data sources
- Help scale the data science models into production-level models that can handle real-time data