Machine Learning Data Engineer

Location:

Delaware, OH

Posted:

April 12, 2025

Contact this candidate

Resume:

LAKSHMI PRASANNA VANKAM

Mobile No:+1-614-***-**** Email:*******************@*****.***

PROFESSIONAL SUMMARY:

Professional having around 7 years of experience in Python,PySpark,Databricks,Apache Airflow, Streamsets,SQL,AWS,Machine Learning,Java with exposure to Health Care,P&C Insurance, TeleCommunication and Legal domain.

EXPERIENCE SUMMARY

• Certified as aDatabricks Data Engineer Associate

• 4+ years of Strong Programing skills inPython

• Around 3 years of experience inDatabricks

• Experienced in Delta Lake and Data Warehousing processes.

• Experienced in handling Terabytes of data loads using Databricks workflows and SQL warehouse and scheduling them

• Worked with Databricks Vendor Engineers several times to solve multiple complex issues around cost optimization

• Good knowledge on differentApache Airflow operatorsand building data pipelines utilizing them.

• Good Knowledge onApache Airflow Datasets

• Worked onTeradata NOSto extract data from TDV toAWS S3

• Experienced in working withDevopstools like Terraform,Jenkins

• Experienced in setting up new CI/CD pipeline which will be deployed in Kubernetes

• Have experience inInformatica Power Centerin loadingdata from CSV files to DB.

• Developed a process to deploy new Airflow DAGs dynamically

• Good experience onAmazon S3, SNS, SQS, Lambda

• Experienced in buildingDimension and Fact tables

• Good experience onKafkastreaming

• Knowledge onAlteryxin merging data from diversedata sources.

• Have good experience on creating data pipelines inStreamsets

• Have experience in extracting data from various sources like Databases and File based system and loading that into delta lake/S3 bucket

• Did Data analysis to choose a data loading mechanism

• Have used aPySpark optimizationtechnique to improveperformance of a job

• Experienced to workAgiledevelopment model

• Experienced in working withGithub

• Experience in production support and production deployment

• Demonstrated strong technical and analytical skills

• Using Pytest library created test cases on python code EDUCATION

2006 - 2010 B.Tech. [Computer Science], 76%, SK University, Anantapur, Andhra Pradesh, India IT SKILLS

Project acquired skills: Kibana, Open shift/Kubernetes,InfluxDB, Kafka, Telegraf, Streamsets, Jenkins, Terraform, AWS Glue, Alteryx, Apache Airflow, Informatica Power Center Languages:Python, SQL, PySpark, Machine Learning

Platforms & Services:Databricks, Amazon S3, AmazonSNS, SQS, Delta Lake RDBMS:Oracle, InfluxDB, MongoDB, MySql, TeradataVantage ORGANIZATIONAL EXPERIENCE

April 2024 –Present with Cigna-Evernorth Services Inc., Polaris, OH, USA April 2023 –April 2024 with Solomon Page, New York, NY, USA March 2022 –April 2023 with Dataquest Corp. Columbus, OH, USA June 2019 –June 2020 with Terralogic Software Solutions, Bangalore Oct 2010 –June 2013 with Tech Mahindra, Hyderabad

KEY PROJECTS

Project:TDV to Delta lake ingestion - Cigna April 2023 - Present Role:Data Engineer

Environment:Python, SQL, PySpark, Teradata, Databricks, AWS S3, Glue,Athena,Airflow, Terraform. Details:

WedealwithCignaCorporateWarehousedata.WeneedtodoDataextractionfromTeradataandload it into Delta lakeandmaintainanothercopyofdataindeltalaketomakeitavailableforthedownstream teams.

We have developed a Dynamic Data ingestion Framework to load data from Teradata to Databricks. We extract data from Teradata using JDBC, EMR or TDV NOS and load into S3 bucket in the format of Parquet or CSV and load this data from S3 bucket to Delta lake using Databricks workflows(Job cluster) and SQL warehouse cluster. We orchestrate all of the tasks using Apache airflow pipeline. My Responsibilities:

• Developed and Deployed a Dynamic Data ingestion framework using python to load table data into Databricks as soon as Teradata load is finished.

• Worked on Data extraction and loading.

• Worked on Databricks workflow optimizations.

• Worked on cost optimization and was successful in reducing DBU’s spent by 50% by monitoring and converting to different job clusters and to SQL warehouse clusters.

• Developed data pipelines using Apache Airflow.

• Used multiple Airflow operators as part of developing the Data pipeline.

• Developed Airflow datasets to communicate a DAG completion with the downstream teams.

• Worked on different Delta table properties to improve read and write performance of a delta table.

• Created Databricks workflow jobs using terraform scripts and airflow DAGs.

• Worked on several Delta lake features like Liquid clustering, Optimization and Vacuuming.

• Developed and Enhanced different Python and Pyspark scripts.

• Worked on to resolve data mismatch between Teradata and Delta Lake.

• Worked on a POC to connect snowflake and Databricks through Alteryx to load data

• Analyzed data mismatches and discussed the same with Teradata teams.

• Monitoring Production jobs and DBU’s consumption for them.

• Testing all enhancements in Dev and test environment on different tables.

• Monitoring performance of each enhancement in all environments.

• Loaded csv files into teradata using Informatica power center and used multiple transformations as part of this. Where inturn we load this data into databricks using our framework.

• Worked on dynamically creating and deploying new airflow DAGs into airflow at the time of deployment

• Worked on DatabricksUnity Catalog to implement data security and data access to downstream teams

• Got familiarity with HIPAA compliance standards

• Created a new CI/CD pipeline with the help of an application team Project:Hercules - Accenture -Nationwide Insurance March 2022- April 2023 Role:Data Engineer

Environment:Python, SQL, Oracle, MySql, PySpark,Databricks, Amazon S3, SNS, SQS, Streamsets. Details:

We deal with Call/ARL/IVR data. We need to do Data load, Harmonization and Curation on this data. My Responsibilities:

• Worked on all three phases Data load, Harmonization, Curation.

• Worked on data analysis to decide loading mechanisms.

• Used Databricks delta lake, PySpark, Python, Amazon S3, SNS, SQS for bulk and incremental load and Streamsets for incremental and CDC data

• As part of Harmonization process built a STTM (Data mapping) document by interacting with source teams and developed dimension tables

• Using initial data tables and harmonized tables created curation/Fact tables and loaded data into that.

• Developed several SQL queries and complex PySpark scripts to load data as part of harmonization and curation.

• Worked on DHF framework

• Created and monitored harmonization and curation jobs both in dev and prod

• Done Unit testing on the data which is loaded in all phases

• Developed a very optimized PySpark code which has improved performance 10 times better. Project:Anomaly Detection - Terralogic Software Solutions June 2019- June 2020 Role:ML Engineer

Environment:Python, Feature Selection Techniques,Imputer Label Encoder, Autoencoder, PCA, InfluxDB, Kibana, Kafka, Telegraf, Open shift, Django, Flask Details:

As part of this project,the Data ingestion team sources LTE call data from the client and transformsdataandstoresitinInfluxDBandsendsthedatatoTelegraf.TelegrafinturnusesKafkaplugin andpublishestoatopic,thesetopicsarethenconsumedbyexecutionandthentrainingmodelstofindthe anomalies in the calls data. Result sets are stored in Elastic search and reported through Kibana Dashboards

• WorkedonAWSS3usingpythonboto3moduletofetchJsondatafroms3bucket,transformthe data into a tabular data and load it into Teradata database.

• Developed python code using Unsupervised Machine learning algorithms such as Autoencoder and PCA.

• Worked on the Data Processing module by cleaning thedatalikeremovingnullsorfillingnulls using SimpleImputerandscalingthenumericalcolumnsusingStandardScalerandencodingthe Categorical columns using LabelEncoder.

• ImplementedtwoKafkaconsumersintheexecutionmoduleusingPythonMultithreadingbecause the Kafka consumers have to listen to both of the topics continuously.

• Checked the performance ofPySparkcomparedwithPandasmoduleandcheckingcompatibility of PySpark with Kafka, Telegraf and InfluxDB

• Monitoring the model's predictions in Elasticsearch,Kibana and setting the threshold valuefor anomalies.

• Explored the InfluxDBforhandlingTimeSeriesdatalikeconnectingInfluxDBfrompythoncode and inserting time series data into InfluxDB.

•Explored Kafka to send data from one module to another module.

•Explored Kafka and Influx dB plugins for Telegraf, installing them and using them in our application.

•Monitoring all the pods in OpenShift.

•Creating Dashboards in Kibana using Elastic search anomalies data.

• Developed a python restful application using Flask library to scrape a website, search in it using Selenium, store all the desired details in MongoDB.

• Wrote test cases for training modules, execution modules and Data processing modules using PyTest framework.

•Handling the application end to end when presenting it to the client. Project:Thomson Reuters Legal (ANZ) - Tech Mahindra Oct 2010- Jun 2013 Role:Java Developer

Environment:Java1.6, Struts, Servlets, Hibernate,JSP, Unix Details:

Worked on ALJI,LOLAapplicationsandWestlawAustraliaapplicationswhicharepartofThomsonReuters Legal (ANZ)project.ALJI (Australian Legal Journal Index)is one of the Editorial and Build systems in Thomson Reuters Legal (ANZ)project.ALJIdealswiththeJournalsofAustralia,which are promoted to Application systems LOLA (Legal Online) and Westlaw Australia. The ALJI functionality is mainly written in JAVA,the content flow was done in XML format.Lucene Framework has been used to promote the content.

LegalOnlineapplicationisalsooneoftheapplicationsinThomsonReutersLegalproject(ANZ).LegalOnline application is developed for hosting international information like Cases,Journals,CurrentAwareness.etc. Legalonlinesiteprovidesafacilitytosearchcontentandtobrowsethroughapredefineddirectorystructure to access documents directly,Emailing the directories to our personal mails andgettingalertswhennew contentisloadedintothesystem,makeaprintofthedocument,andsavethedocumentintoourpersonal drivers.

Westlaw Australia application is similar to Legal Online application,it has been built by using a different framework called Multiple Application Framework (MAF). My Responsibilities:

• Worked on sub modules like ALJI, LOLA and worked on Westlaw Australia Application.

• Worked on Styling the webpage as per the user requirement.

• Worked on some major enhancements by adding some more functionalities to the existing process

• Worked on bug-fixes and issues

• Maintaining day to day activities sheet.

• Giving updates to the customer about the issue on which we are handling in the JIRA tool.

• Performing initial testing and requesting business users for final testing.

• Checking the log files to analyze the issue by using Putty or WinSCP.

• Worked on application servers in content related issues.

• Daily getting updated code using TFS & SVN and putting code changes in TFS and SVN.

Contact this candidate