Data Engineer Big

Location:

Prosper, TX

Posted:

December 04, 2024

Contact this candidate

Resume:

Sairam

Email: ******.************@*****.*** Phone:940-***-****

Sr. Cloud Data Engineer

Professional Summary

Almost 9 years of extensive IT experience as a data engineer with expertise in designing data-intensive applications using Hadoop Ecosystem and Big Data Analytical, Cloud Data engineering (AWS, Azure), Data Visualization, Data Warehouse, Reporting, and Data Quality solutions.

Hands-on expertise with the Hadoop ecosystem, including strong knowledge of Big Data technologies such as HDFS, Spark, YARN, Kafka, MapReduce, Apache Cassandra, HBase, Zookeeper, Hive, Oozie, Impala, Pig, and Flume.

With the knowledge on Spark Context, Spark-SQL, Data frame API, Spark Streaming, and Pair RDD's, worked extensively on PySpark to increase the efficiency and optimization of existing Hadoop approaches.

Good understanding of Spark Architecture including Spark Core, Spark SQL, Data Frames, Spark Streaming, Driver Node, Worker Node, Stages, Executors, and Tasks.

In-depth understanding and experience with real-time data streaming technologies such as Kafka and Spark Streaming.

Hands-on experience on AWS components such as EMR, EC2, S3, RDS, IAM, Auto Scaling, CloudWatch, SNS, Athena, Glue, Kinesis, Lambda, Redshift, DynamoDB to ensure a secure zone for an organization in AWS public cloud.

Skilled in setting up Kubernetes clusters using tools like kubeadm, kops, or managed Kubernets services (e.g., Amazon EKS, Google GKE, Azure AKS).

Proven experience deploying software development solutions for a wide range of high-end clients, including Big Data Processing, Ingestion, Analytics, and Cloud Migration from On-Premises to AWS Cloud.

Expertise in Azure infrastructure management (Azure Web Roles, Worker Roles, SQL Azure, Azure Storage).

Strong Experience in working with ETL Informatica which includes components Informatica PowerCenter Designer, Workflow manager, Workflow monitor, Informatica server and Repository Manager.

Good understanding of Spark Architecture with Data bricks, Structured Streaming. Setting Up AWS and Microsoft Azure with Data bricks, Data bricks Workspace for Business Analytics, Manage Clusters In Data bricks, Managing the Machine Learning Lifecycle

Demonstrated understanding of the Fact/Dimension data warehouse design model, including star and snowflake design methods.

Experience in working with Waterfall and Agile development methodologies.

Experience identifying data anomalies performing statistical analysis and data mining techniques.

Experience in Hadoop Development/Administration Proficient in programming knowledge of Hadoop and Ecosystem components Hive, HDFS, Pig, Sqoop, HBase, Python, spark.

Experience in developing custom UDFs for Pig and Hive.

Experienced in building Snow pipe and In-depth knowledge of Data Sharing in Snowflake and Snowflake Database, Schema and Table structures.

Demonstrated ability to ensure high availability and fault tolerance by setting up Kubernetes clusters across multiple nodes.

Designed and developed logical and physical data models that utilize concepts such as Star Schema, Snowflake Schema and Slowly Changing Dimensions.

Expertise in using Airflow and Oozie to create, debug, schedule, and monitor ETL jobs.

Experience of Partitions, bucketing concepts in Hive, and designed both Managed and External tables in Hive to optimize performance.

Experienced in configuring and administering the Hadoop Cluster using major Hadoop Distributions like Apache Hadoop and Cloudera.

Hands-on experience in handling database issues and connections with SQL and No SQL databases such as MongoDB, HBase, SQL server. Created Java apps to handle data in MongoDB and HBase.

TECHNICAL SKILLS

Big Data Technologies: Hadoop MapReduce, HDFS Sqoop, PIG, Hive, HBase, Oozie, Flume, NiFi, Kafka, Zookeeper. Yarn, Apache Spark, Mahout, Sparklib.

Databases: Oracle, MySQL, SQL Server, MongoDB, Cassandra, DynamoDB, PostgreSQL, Teradata, Cosmos.

Programming: Python, PySpark, Scala, Java, C, C++, Shell script, Perl script, SQL

Cloud Technologies: AWS, Microsoft Azure, GCP

Frameworks: Django REST framework, MVC, Hortonworks

Tools: PyCharm, Eclipse, Visual Studio, SQL*Plus, SQL Developer, TOAD, SQL Navigator, Query Analyzer, SQL Server Management Studio, SQL Assistance, Eclipse, Postman

Versioning tools: SVN, Git, GitHub

Operating Systems: Windows 7/8/XP/2008/2012, Ubuntu Linux, MacOS

Network Security: Kerberos

Database Modelling: Dimension Modeling, ER Modeling, Star Schema Modeling, Snowflake Modeling

Monitoring Tool: Apache Airflow

Visualization/ Reporting: Tableau, ggplot2, matplotlib, SSRS and Power BI

Machine Learning Techniques: Linear & Logistic Regression, Classification and Regression Trees, Random Forest, Associative rules, NLP and Clustering.

Professional Experience

Sr. Data Engineer July 2021 to Present

Apple - Sunnyvale, CA

Responsibilities:

Implemented solutions utilizing Advanced AWS Components: EMR, EC2, etc. integrated with Big Data/Hadoop Distribution Frameworks: Hadoop YARN, MapReduce, Spark, Hive, etc.

Designed and implemented Azure infrastructure solutions using azure resource manager (ARM) templates and Azure CLI.

Created on-demand tables on S3 files using Lambda Functions and AWS Glue using Python and PySpark.

Proficient in designing and implementing data models to support business requirements and improve data integrity and efficiency.

Experience with Azure Devops for continuous integration and continuous deployment (CI/CD) pipelines.

Integrated Kubernetes with CI/CD pipelines automating the deployment process using tools like Jenkins, Gitlab CI/CD or CircleCI.

Strong understanding of container orchestration principles and experience in scaling, updating, and monitoring containerized applications using Kubernetes.

We designed and implemented data extraction, transformation, and loading (ETL) processes with talend to meet project requirements.

Experience in implementing and managing Azure Active Directory (Azure AD) for identity and access management.

Proficient in Apache spark and Data bricks, including data processing, data manipulation, and data analysis using pyspark.

Involved in code migration of quality monitoring tool from AWS EC2 to AWS Lambda and built logical datasets to administer quality monitoring on snowflake warehouses.

Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and created dags to run the Airflow.

Implemented robust security measures within AWS Govcloud including encryption, access control and multifactor authentication ensuring the protection of sensitive data and compliance with ITAR requirements.

Collaborated with cross functional teams to design and implement high availability and disaster recovery solutions specific to AWS govcloud resulting in a uptime for mission critical applications.

Hands-on-experience with Docker for containerization, creating Docker images, and optimizing image sizes for efficient deployments.

In depth understanding of Kubernetes concepts such as namespaces, labels, selectors, and resource management.

Loaded the data into Spark RDD and performed in-memory data computation to generate the output response.

Proficient in working with Parquet, a columnar storage file format designed for big data processing frameworks like Apache Hadoop and Apache Spark.

Leveraged Dremio’s query acceleration capabilities to optimize SQL queries, improve performances, and reduce query execution times.

Developed java databased database connectors and drivers for various databases(MySQL, PostgreSQL, Oracle) to enable seamless interaction between the data processing applications and databases systems, ensuring data consistency and integrity.

Used Airbyte to optimize data integration pipelines for improved performance, scalability, and efficiency.

Proficient in working with Apache Iceberg, a table format that supports schema evolution, versioning, and time travel queries, enabling easy and efficient data schema changes.

Experienced in building and optimizing data pipelines in Databricks, leveraging Spark SQL and Data Frame APIs

Experience in leveraging Data Bricks for machine learning and implementing scalable ML models.

Implemented ETL workflows on data bricks, integrating various data sources and transforming raw data into meaningful insights using Apache spark libraries.

Knowledge of Data bricks clusters and their configurations for optimal performance and scalability.

Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily execution in production.

Queried both Managed and External tables created by Hive using Impala.

Monitored and controlled Local disk storage and Log files using Amazon CloudWatch.

Played a key role in dynamic partitioning and Bucketing of the data stored in Hive Metadata.

Involved with extraction of large volumes of data and analysis of complex business logics; to derive business -oriented insights and recommending/proposing new solutions to the business in Excel Report.

Experienced in performance tuning of Spark Applications for setting right Batch Interval time, correct level of Parallelism and memory tuning.

Encoded and decoded js on objects using PySpark to create and modify the data frames in Apache Spark

Implemented Dremio-based data virtualization solution that reduced query response times by 50% and improved data accessibility for business analysts.

Expertise in data virtualization using Dremio, and created virtual datasets and provided a unfied view of data from disparate sources.

Develop ETL jobs to automate the real time data retrieval from Salesforce.com, suggest best methods for data replication from Salesforce.com.

Engineered end-to-end data pipelines for processing and storage large volumes of data in Azure Data Lake Storage.

Designed and optimized data partitioning strategies in Azure Data Lake Storage for efficient data retrieval and storage cost reduction.

Experience in integrating Databricks with data source such as Azure data lake storage, Azure blob storage and AWS S3.

Developed and maintained robust, scalable web applications using Django adhering to the Model-View-Controller Architectural pattern.

Built efficient backend services in Django handling authentication, and authorization, and user management functionalities.

Perform Data Cleaning, features scaling, features engineering using pandas and NumPy packages in python.

Experience in leveraging Redis data structures like strings, hashes, lists, sets, and sorted sets to optimize data storage and retrieval for specific use cases.

Automated unit test to ensure code coverage and validate expected behaviors enhancing the stability and reliability of the code base.

Ensured cross platform compatibilty by performing functional test on different operating systems and mobile devices.

Strong understanding of Redis caching techniques, including implementing cache strategies, key expiration, eviction policies, and cache invalidation mechanisms to improve application performance.

Strong knowledge of python libraries and frameworks such as Numpy, pandas, Flask, Django, and Tensorflow.

Familiarity with object-oriented programming (OOP) principles and design patterns in Python.

Experience in working with APIs and integrating external services into python application.

Proficient in data manipulation and analysis using pandas, including data cleaning, transformation and aggregation..

Used Informatica Power Center for extraction, transformation, and loading (ETL) of data in the data warehouse.

Environment: Spark RDD. AWS Glue, Apache Kafka, Amazon S3, Java, SQL, Spark, AWS cloud, AWS Govcloud, Azure Data Lake, Azure, Data bricks, ETL, NumPy, SciPy, pandas, Scikit-learn, Seaborn, NLTK) and Spark 1.6 / 2.0 (PySpark, MLlib, EMR, EC2, and amazon RDS. Data lake, Kubernetes, Docker, Python, Cloudera Stack, HBase, Hive, Impala, Pig, NiFi, Spark, Spark Streaming, Elastic Search, Logstash, Apache parquet, Apache iceberg, Kibana, JAX-RS, Spring, Hibernate, Apache Airflow, Oozie, RESTFul API, JSON, JAXB, XML, WSDL, MySQL, Talend, fCassandra, MongoDB, HDFS, ELK/Splunk, Athena, tableau, redshift, Scala, snowflake.

Syntel/Cuna Mutual - Madison, WI October 2019 to June 2021

Sr. AWS Data Engineer

RESPONSIBILITIES:

Exploring with the Spark improving the performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark SQL, Data Frame, and Spark Yarn.

Involved in file movements between HDFS and AWS S3 and extensively worked with S3 bucket in AWS and converted all Hadoop jobs to run in EMR by configuring the cluster according to the data size.

Wrote Spark applications for Data validation, cleansing, transformations and custom aggregations and imported data from different sources into Spark RDD for processing and developed custom aggregate functions using Spark SQL and performed interactive querying

Worked on data pipeline creation to convert incoming data to a common format, prepare data for analysis and visualization, migrate between databases, share data processing logic across web apps, batch jobs, and APIs, consume large XML, CSV, and fixed-width files and created data pipelines in Kafka to replace batch jobs with real-time data.

Involved in converting Hive/SQL queries into Spark Transformations using Spark.

RDDs and Scala and involved in using SQOOP for importing and exporting data between RDBMS and HDFS.

Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregations on the fly to build the common learner data model and persistence the data in HDFS.

Created AWS Glue job for archiving data from Redshift tables to S3 (online to cold storage) as per data retention requirements and involved in managing S3 data layers and databases including Redshift and Postgres.

Processed the web server logs by developing multi-hop flume agents by using Avro

Sink and loaded into MongoDB for further analysis and worked on MongoDB NoSQL data modeling, tuning, disaster recovery and backup.

Developed a Python Script to load the CSV files into the S3 buckets and created

Experienced in building and optimizing data pipelines in Databricks, leveraging Spark SQL and Data Frame APIs

Collaborated with cross functional teams to design and implement high availability and disaster recovery solutions specific to AWS govcloud resulting in a uptime for mission critical applications.

Implemented ETL workflows on data bricks, integrating various data sources and transforming raw data into meaningful insights using Apache spark libraries.

Developed and maintained robust, scalable web applications using Django adhering to the Model-View-Controller Architectural pattern.

Built efficient backend services in Django handling authentication, and authorization, and user management functionalities.

Experience in leveraging Data Bricks for machine learning and implementing scalable ML models.

Implemented efficient data ingestion processes to bring structured and unstructured into Azure Data Lake Storage.

Implemented data encryption and ensured data security using Azure Data Lake encryption and Azure key vault integration.

Developed data processing workflows using Azure Data Factory for ETL operations, ensuring data quality, transformation, and enrichment.

Knowledge of Data bricks clusters and their configurations for optimal performance and scalability.

AWS S3 buckets, performed folder management in each bucket, managed logs and objects within each bucket

Worked with different file formats like JSON, AVRO and parquet and compression techniques like snappy and developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool.

Extensive hands on experience with percona MYSQL, including installation, configuration, and ongoing maintenance.

Proven ability to optimize MYSQL databases for higher performances, including query optimization, indexing strategies, and server parameter tuning.

Managed the deployment and ongoing maintenance of percona MYSQL database ensuring high availability and optimal performances for critical applications.

Expertise in implementing high availability solutions such as MYSQL replication, clustering, and failover mechanism to ensure database reliability.

Developed shell scripts for dynamic partitions adding to hive stage table, verifying JSON schema change of source files, and verifying duplicate files in source location.

Worked with importing metadata into Hive using Python and migrated existing tables and applications to work on AWS cloud (S3).

Integrated Hadoop into traditional ETL, accelerating the extraction, transformation, and loading of massive structured and unstructured data.

Involved with writing scripts in Oracle, SQL Server and Netezza databases to extract data for reporting and analysis and worked in importing and cleansing of data from various sources like DB2, Oracle, flat files onto SQL Server with high volume data

Container management using Docker by writing Docker files and set up the automated build on Docker HUB and installed and configured Kubernetes.

Worked extensively with importing metadata into Hive and migrated existing tables and applications to work on Hive and AWS cloud and making the data available in Athena and Snowflake.

Extensively used Stash Git-Bucket for Code Control and Worked on AWS Components such as Airflow, Elastic Map Reduce (EMR), Athena and Snowflake.

Environment: Spark, AWS, AWS govcloud, Azure Data Lake, EC2, EMR, Hive, Java, SQL Workbench, Tableau, Kibana, Sqoop, Spark SQL, Spark Streaming, Scala, Python, Hadoop (Cloudera Stack), Informatica, NVIDIA Clara, Jenkins, Docker, Hue, Spark, Netezza, Kafka, HBase, HDFS, Hive, Pig, Sqoop, Oracle, ETL, AWS S3, AWS Glue, GIT, Grafana.

Value Payment System – Nashville, TN January 2018 to September 2019

Big Data Engineer

Responsibilities:

Created Spark jobs by writing RDDs in Python and created data frames in Spark SQL to perform data analysis and stored in Azure Data Lake.

Engineered Robust data ingestion pipelines using Azure Data factory to efficiently bring diverse data sources into Azure Data Lake Storage.

Implemented Optimized data storage solutions within Azure Data Lake, including file formats partitioning and compression techniques, reducing storage costs and improving query performances.

Configured Spark Streaming to receive real-time data from the Apache Kafka and store the stream data to HDFS using Scala.

Developed Spark Applications by using kafka and Implemented Apache Spark data processing project to handle data from various RDBMS and Streaming sources.

Created various data pipelines using Spark, Scala and SparkSQL for faster processing of data.

Designed batch processing jobs using Apache Spark to increase speed compared to that of MapReduce jobs.

Written Spark-SQL and embedded the SQL in SCALA files to generate jar files for submission onto the Hadoop cluster.

Developed data pipeline using Flume to ingest data and customer histories into HDFS for analysis.

Executing Spark SQL operations on JSON, transforming the data into a tabular structure using data frames, and storing and writing the data to Hive and HDFS.

Worked with HIVE data warehouse infrastructure-creating tables, data distribution by implementing partitioning and bucketing, writing, and optimizing the HQL queries.

Created hive tables as per requirement were Internal or External tables defined with appropriate static, dynamic partitions, and bucketing, intended for efficiency.

Used Hive as an ETL tool for event joins, filters, transformations, and pre-aggregations.

Involved in moving all log files generated from various sources to HDFS for further processing through Kafka.

Extracting real-time data using Kafka and Spark streaming by Creating DStreams and converting them into RDD, processing it, and stored it into.

Used Spark SQL for Scala interface that automatically converts RDD case classes to schema RDD.

Extracted source data from Sequential files, XML files, CSV files, transformed and loaded it into the target Data warehouse.

Solid understanding of No SQL Database (MongoDB and Cassandra).

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Spark SQL and Scala extracted large datasets from Cassandra and Oracle servers into HDFS and vice versa using Sqoop.

Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs and Scala.

Involved in Migrating the platform from Cloudera to EMR platform.

Developed analytical component using Scala, Spark and Spark Streaming.

Worked on developing ETL processes to load data from multiple data sources to HDFS using FLUME and performed structural modifications using HIVE.

Provided technical solutions on MS Azure HDInsight, Hive, HBase, MongoDB, Telerik, Power BI, Spot Fire, Tableau, Azure SQL Data Warehouse Data Migration Techniques using BCP, Azure Data Factory, and Fraud prediction using Azure Machine Learning.

Environment: Hadoop, Hive, Azure Data Lake, Kafka, Snowflake, Spark, Scala, HBase, Cassandra, JSON, XML, UNIX Shell Scripting, Cloudera, MapReduce, Power BI, ETL, MySQL, No SQL

Big Data Engineer

Renee Systems Inc-Hyderabad September 2016 to November 2017

Responsibilities:

Collaborated with business user's/product owners/developers to contribute to the analysis of functional requirements.

Implemented Spark SQL queries that combine hive queries with Python programmatic data manipulations supported by RDDs and data frames.

Used Kafka Streams to Configure Spark streaming to get information and then store it in HDFS.

Extract Real-time feed using Spark Streaming and convert it to RDD and process data in the form of Data Frame and save the data in HDFS.

Developing Spark scripts, UDFS using Spark SQL query for data aggregation, querying, and writing data back into RDBMS through Sqoop.

Installed and configured Hadoop MapReduce HDFS Developed multiple MapReduce jobs in java for data cleaning and preprocessing.

Installed and configured Pig and also written Pig Latin scripts.

Wrote MapReduce job using Pig Latin.

Worked on analyzing Hadoop clusters using different big data analytic tools including HBase database and Sqoop.

Worked on importing and exporting data from Oracle, and DB2 into HDFS and HIVE using Sqoop for analysis, visualization, and generating reports.

Creating and inserting data into Hive tables for dynamically inserting data into data tables using partitioning and bucketing for EDW tables and historical metrics.

Experienced in handling large datasets using Partitions, Spark in Memory capabilities, Broadcasts in Spark, Effective & efficient Joins, Transformations, and others during the ingestion process itself.

Created ETL packages with different data sources (SQL Server, Oracle, Flat files, Excel, DB2, and Teradata) and loaded the data into target tables by performing different kinds of transformations using SSIS.

Designed, developed data integration programs in a Hadoop environment with No SQL data store Cassandra for data access and analysis.

Created partitions, bucketing across the state in Hive to handle structured data using Elastic search.

Performed Sqooping for various file transfers through the HBase tables for processing of data to several No SQL DBs- Cassandra, Mongo DB.

Environment: Hadoop, MapReduce, HDFS, Hive, python, Kafka, HBase, Sqoop, No SQL, Spark 1.9, PL/SQL, Oracle, Cassandra, Mongo DB, ETL, MySQL

Data Analyst May 2015 to August 2016

Concept IT INC, Noida

Responsibilities:

Involved in designing physical and logical data model using ERwin Data modeling tool.

Designed the relational data model for operational data store and staging areas, Designed Dimension & Fact tables for data marts.

Extensively used ERwin data modeler to design Logical/Physical Data Models, relational database design.

Created Stored Procedures, Database Triggers, Functions and Packages to manipulate the database and to apply the business logic according to the user's specifications.

Created Triggers, Views, Synonyms and Roles to maintain integrity plan and database security.

Creation of database links to connect to the other server and access the required info.

Integrity constraints, database triggers and indexes were planned and created to maintain data integrity and to facilitate better performance.

Used Advanced Querying for exchanging messages and communicating between different modules.

System analysis and design for enhancements Testing Forms, Reports and User Interaction.

Environment: Oracle 9i, SQL* Plus, PL/SQL, ERwin, TOAD, Stored Procedures.

Contact this candidate