Machine Learning Data Engineer

Location:

Chennai, Tamil Nadu, India

Posted:

February 26, 2025

Contact this candidate

Resume:

PAVAN

KALYAN

PUNUGUPATI

Mobile: 234-***-**** ************@*****.***

SUMMARY

Experienced Data Engineer with 3+ years of experience delivering data-driven solutions to enhance data processing efficiency, accuracy, and utility. Specializes in cloud technologies, data warehousing, and machine learning to drive impactful business outcomes. Key highlights include:

• Cloud Platforms & AWS Services: Extensive experience with AWS services such as Redshift, S3, RDS, Lambda, and Glue, including the migration of on-premises databases to cloud platforms. Proficient in managing cloud resources using AWS CloudFormation, EC2, CloudWatch, and Elastic Load Balancing.

• Data Warehousing & ETL: Skilled in designing and maintaining scalable data warehouse solutions using AWS Redshift and RDS. Expert in building and optimizing ETL pipelines with AWS Glue, ensuring seamless data integration and management with AWS Data Lake and Lambda.

• Data Analysis & Machine Learning: Expertise in using Python, R, SQL, Hive, PySpark, and Spark SQL for data mining, cleansing, and analysis. Hands-on experience developing machine learning models for classification, regression, clustering, and decision trees.

• Programming & Development: Proficient in web application development using Python, Django, Flask, C++, HTML, CSS, and JavaScript. Strong in writing optimized SQL queries, stored procedures, triggers, and views for efficient data management.

• Containerization & CI/CD: Experience with containerization tools like Docker and Kubernetes, and build automation using CI/CD tools such as Jenkins, Maven, and Apache ANT.

• Web Services & APIs: Extensive experience in developing web services (SOAP and REST) using Python, optimizing data processing workflows.

skilled in Agile methodologies (SCRUM), Test-driven Development (TDD), and delivering scalable, innovative data solutions. TECHNICAL SKILLS

Operating Systems: Windows, Mac OS, Linux (CentOS, Debian, Ubuntu) Programming Languages: Python, R, C, C++

Web Technologies: HTML/HTML5, CSS/CSS3, XML, jQuery, JSON, Bootstrap, Angular Python Libraries/Packages: NumPy, SciPy, Boto, Pickle, PySide, PyTables, DataFrames, Pandas, Matplotlib, SQLAlchemy, HTTPLib2, Urllib2,

BeautifulSoup, PyQuery

Statistical Analysis Skills: A/B Testing, Time Series Analysis, Markov Chains IDE: PyCharm, PyScripter, Spyder, PyStudio, PyDev, IDLE, NetBeans, Sublime Text, Visual Studio Code Machine Learning and Analytical Tools: Supervised Learning (Linear Regression, Logistic Regression, Decision Tree, Random Forest, SVM, Classification), Unsupervised Learning (Clustering, KNN, Factor Analysis, PCA), Natural Language Processing, Google Analytics, Fiddler, Tableau

Cloud Computing: AWS, Azure, Rackspace, OpenStack, Redshift, AWS Glue AWS Services: Amazon EC2, Amazon S3, Amazon Simple DB, Amazon MQ, Amazon ECS, Amazon Lambda, Amazon Sagemaker, Amazon RDS, Amazon Elastic Load Balancing, Elastic Search, Amazon SQS, AWS Identity and Access Management, AWS CloudWatch, Amazon EBS, Amazon CloudFormation

Databases/Servers: MySQL, SQLite3, Cassandra, Redis, PostgreSQL, CouchDB, MongoDB, Teradata, Apache Web Server 2.0, NginX, Tomcat,

JBoss, WebLogic

ETL: Informatica 9.6, DataStage, SSIS

Web Services/Protocols: TCP/IP, UDP, FTP, HTTP/HTTPS, SOAP, REST, RESTful Miscellaneous: Git, GitHub, SVN, CVS

Build and CI Tools: Docker, Kubernetes, Maven, Gradle, Jenkins, Hudson, Bamboo SDLC/Testing Methodologies: Agile, Waterfall, Scrum, TDD

CERTIFICATIONS

• Excel: Managing and Analyzing Data from LinkedIn Learning.

• SQL Essential Training from LinkedIn Learning.

WORK EXPERIENCE

Client: Elevance Health Indianapolis, IN

Data Engineer [September 2023]–PRESENT

Developed a data platform from scratch, actively participating in requirement gathering and analysis, and documenting business requirements to ensure alignment with business goals.

• Extensive experience in migrating on-premise database structures to cloud environments like AWS Redshift, ensuring efficient data transfer and storage.

• Created and optimized data pipelines for Kafka clusters, processed data using Spark Streaming, and utilized AWS Glue to manage incremental data loads into S3 staging and persistence areas, enhancing data processing efficiency.

• Developed REST APIs using Python with Flask and Django frameworks, integrating diverse data sources including Java, JDBC, RDBMS, Shell scripting, spreadsheets, and text files, ensuring seamless data accessibility and integration.

• Worked with Hadoop architecture, managing Hadoop daemons including NameNode, DataNode, Job Tracker, Task Tracker, and Resource Manager, ensuring optimal performance of the Hadoop ecosystem.

• Utilized AWS Data Pipeline to automate and schedule data loads from S3 into Redshift, streamlining ETL workflows and ensuring timely and accurate data processing.

• Developed and optimized data ingestion processes, using tools like Hive, Pig, and MapReduce for efficient data transformation and loading into Data Warehouses.

• Performed data extraction, aggregation, and consolidation within AWS Glue, optimizing data flows for more efficient processing and analytics.

• Scheduled and automated jobs using tools like Crontab, RunDeck, Control-M, and Oozie, improving operational efficiency and ensuring timely execution of data tasks across systems.

• Built Cassandra queries for CRUD operations, and used Bootstrap for designing responsive HTML page layouts, improving the user experience for web applications.

• Developed full-stack applications using Python with Django, integrating JavaScript, Bootstrap, Cassandra, MySQL, and HTML5/CSS to create dynamic and interactive web applications.

• Used Sqoop for data import/export operations (e.g., copying data from HDFS) and developed Spark code, Spark-SQL, and Spark Streaming for fast testing and processing of data.

• Analyzed SQL scripts and implemented data processing solutions using PySpark for efficient handling and analysis of large datasets.

• Utilized JSON and XML SerDe's for serialization and deserialization, efficiently managing data in Hive tables and enhancing data integration processes.

• Utilized SparkSQL to load and process JSON data, created Schema RDDs, and optimized data handling and querying by loading data into Hive tables.

• Developed PySpark-based data processing tasks, including reading from external sources, merging datasets, performing data enrichment, and loading data into target destinations.

• Integrated AWS services like S3 and RDS to host static/media files and databases in the cloud, enhancing data accessibility and scalability.

• Developed applications in a Linux environment, using essential commands and Jenkins for continuous integration and deploying projects via GIT version control.

• Used Docker for continuous delivery in a highly scalable environment, integrated with Nginx for efficient load balancing and system reliability.

• Developed MongoDB-based applications, storing data in JSON format, and created interactive dashboards with Python, Bootstrap, CSS, and JavaScript to present data insights effectively. Client: Cipla Bangalore, India

Junior Data Engineer [March 2020 – July 2022]

• Developed entire frontend and backend modules using Python on the Django Web Framework.

• Used Django framework for application development.

• Designed and developed the UI of the website using HTML, AJAX, CSS, and JavaScript.

• Worked on CSS Bootstrap to develop web applications.

• Designed ETL Process using Informatica to load data from flat files and Excel files to the target Oracle Data Warehouse database.

• Designed and developed web services using XML and jQuery.

• Built various graphs for business decision-making using the Python matplotlib library.

• Worked on the development of applications, especially in a UNIX environment, and was familiar with all related commands.

• Used NumPy for numerical analysis, specifically for insurance premiums.

Contact this candidate