Data Engineer Machine Learning

Location:

Monroe, NC

Posted:

July 05, 2024

Contact this candidate

Resume:

RAHEEM SD

Phone:510-***-****

Email: ad60li@r.postjobfree.com

PROFESSIONAL SUMMARY:

Around 10+ years of IT development experience, including experience in Data Engineer ecosystem, and related technologies.

Good understanding of Apache Spark, Kafka, Storm, Talend, RabbitMQ, Elastic Search, Apache Solr, Splunk and BI tools such as Tableau.

Knowledge of Hadoop administration activities using Cloudera Manager and Apache Ambari.

Experience working with Cloudera, Amazon Web Services (AWS), Microsoft Azure and HortonworksWorked on Import and Export of data using Sqoop from RDBMS to HDFS.

Have good knowledge in Containers, Docker and Kubernetes for the runtime environment for the CI/CD system to build, test, and deploy.

Integrating BigQuery with other Google Cloud Platform services (e.g., Dataflow, Data prep and Data Studio) and third-party tools for data integration, processing, visualization, and reporting.

Created machine learning models with help of python and scikit-learn.

Utilized Apache Airflow to design, schedule, and monitor complex data pipelines, ensuring timely execution and error handling.

Designed and implemented Elasticsearch indexes to efficiently store and retrieve structured and unstructured data, enabling fast and accurate search capabilities.

Identified and resolved performance bottlenecks through query profiling, index optimization, and shard management, improving search responsiveness and resource utilization.

Hands on experience in loading data (Log files, Xml data, JSON) into HDFS using Flume/Kafka.

Used packages like Numpy, Pandas, Matplotlib, Plotly in python for exploratory data analysis.

Hands on experience with cloud technologies such as Azure HDInsight, Azure Data Lake, AWS EMR, Athena, Glue and S3.

Good knowledge in using Apache NiFi to automate the data movement between different Hadoop systems.

Experience in performance tuning by using Partitioning, Bucketing and Indexing in Hive.

Experience with Software development tools such as JIRA, GIT and SVN.

Designing and optimizing relational database schemas in MySQL for efficient storage, retrieval, and querying of structured data.

Developed APIs to facilitate seamless integration of AI models into production environments

As the Azure DevSecOps Engineer, created and maintained Azure DevOps organizations, self-hosted.

Worked on containerization to optimize the CI/continuous deployment (CD) workflow as group efforts.

Experience in SSIS project-based deployments in Azure cloud.

Proficient in using data modeling tools such as ERwin, Power Designer, or similar tools to create and manage complex data models.

Implemented robust automated testing using Python frameworks like Pytest and Robot, ensuring comprehensive software quality assurance.

Capable in using Amazon S3 to support data transfer over SSL and the data gets encrypted automatically once it is uploaded.

Familiarity with DBT (Data Build Tool) for managing and orchestrating data transformation workflows, enhancing data quality and maintainability.

Developed interactive dashboards and visualizations using Amazon quick Sight to provide real-time insights into business performance metrics.

Direct experience in developing microservices, REST APIs, and web services, with proficiency in .Net, Azure, and Ocelot.

Designed serverless applications using AWS Lambda to reduce infrastructure management overhead

Developed SSIS packages to automate data extraction from various banking systems, ensuring data accuracy and consistency

Extensive experience in Java SE and Java EE, including features of the latest Java versions

Proficient in creating compelling data visualizations using tools such as Tableau and Power BI.

TECHNICAL SKILLS:

Hadoop and Big Data Technologies

HDFS, MapReduce, Flume, Sqoop, Pig, Hive, Morphline, Kafka, Oozie, Spark, Nifi, Zookeeper, Elastic Search, tableau Apache Solr, snowflake, Talend, Cloudera Manager, R Studio, Confluent, Grafana

NoSQL

HBase, Couchbase, Mongo, Cassandra

Programming and Scripting Languages

C, SQL, Python, C++, Shell scripting, R

Web Services

XML, SOAP, Rest API

Databases

Oracle, DB2, MS-SQL Server, MySQL, MS-Access, Teradata

Web Development Technologies

JavaScript, CSS, CSS3, HTML, HTML5, Bootstrap, XHTML, JQUERY, PHP

Operating Systems

Windows, Unix (Red Hat Linux, Cent OS, Ubuntu), MAC-OS,

IDEDevelopmentTools

Eclipse, Net Beans, IntelliJ, R Studio

Build Tools

Maven, Scala Build Tool (SBT), Ant

EDUCATION:

Bachelors in computer science 2013

Masters’ northwest Missouri State University-computer science 2017

PROFESSIONAL EXPERIENCE:

Sr. Data Engineer

Capital One – Plano, TX Nov 2021 – Till* date

Responsibilities:

Involved in analyzing business requirements and prepared detailed specifications that follow project guidelines required for project development.

Responsible for data extraction and data ingestion from different data sources into S3 by creating ETL pipelines using Spark and Hive.

Designing and optimizing database schemas in ClickHouse for efficient storage and querying of analytical data.

Used Pyspark for dataframes, ETL, Data Mapping, Transformation and Loading in complex and high-volume environment.

Designed and executed regression testing suites to ensure that changes to existing code or data pipelines do not introduce unintended consequences or errors.

Implemented data validation and verification checks to ensure data completeness, consistency, and accuracy throughout the data lifecycle.

Monitoring Kubernetes clusters and workloads using built-in metrics, logs, and third-party monitoring solutions like Prometheus and Grafana.

Designing and developing ETL jobs and workflows using AWS Glue's serverless, fully managed data integration service.

Built robust ETL processes with Django, transforming raw data into valuable insights for business intelligence and analytics.

Integrated Lambda functions with other AWS services like API Gateway, S3, DynamoDB, SQS, SNS, and RDS

Extensively worked with pyspark / Spark SQL for data cleansing and generating dataframes and RDDs.

Co-ordinated with the other team members to write and generate test scripts, test cases for numerous user stories.

Collaborated with data scientists and software engineers to integrate generative AI solutions into existing systems and applications.

Implemented NLP models for text generation, achieving human-like fluency and coherence in generated content

Designing and optimizing data schemas in Google BigQuery for efficient storage, querying, and analysis of large-scale datasets.

Loading data into Amazon Redshift from various sources, such as Amazon S3, RDS, DynamoDB, and other relational databases, using COPY commands, AWS Glue, or other ETL tools.

Defining and scheduling Hadoop MapReduce, Hive, Pig, and Spark jobs using Oozie's XML-based workflow definitions.

Defining task dependencies and workflows using Luigi's Python-based API to orchestrate ETL processes.

Building efficient Docker images by leveraging multi-stage builds, optimizing layer caching, and minimizing image size.

Creating and maintaining MySQL database instances, configuring server settings, and managing database users and permissions.

Configuring and managing PostgreSQL instances, including installation, configuration, tuning, and monitoring for optimal performance and reliability.

Designing and orchestrating complex ETL workflows using Directed Acyclic Graphs (DAGs) in Apache Airflow.

Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.

Worked on EMR clusters of AWS for processing Big Data across a Hadoop Cluster of virtual servers.

Developed Spark Programs for Batch Processing.

Developed Spark code using python for pyspark/Spark-SQL for faster testing and processing of data.

Involved in design and analysis of the issues and providing solutions and workarounds to the users and end-clients.

Designed and built data processing applications using Spark on AWS EMR cluster which consumes data from AWS S3 buckets, apply necessary transformations and store the curated business ready datasets onto Snowflake analytical environment.

Developed functionality to perform auditing and threshold checks for error handling for smooth and easier debugging and data profiling.

Integrating AWS Glue with AWS services (e.g., S3, Redshift, RDS, DynamoDB) for seamless data integration and processing.

Developing custom Luigi tasks and targets to interact with data sources, filesystems, and external services.

Build data quality framework to run data rules that can generate reports and send emails of business critical successful and failed job notifications to business users daily.

Used spark to build tables that require multiple computations and non equi-joins.

Scheduled various spark jobs for daily and weekly.

Configuring and managing Apache Airflow clusters, schedulers, executors, and workers for optimal performance and scalability.

Developed RESTful APIs with Django Rest Framework (DRF) to facilitate data access and manipulation across various services and applications.

Created ETL using Python API’s/AWS Glue/terraform/Gitlab to consume data from different source

systems (Smartsheet/ QuickBase/ google sheets) to Snowflake

Modelled Hive partitions extensively for faster data processing.

Configuring Oozie coordinators and workflows to orchestrate ETL tasks and dependencies in Hadoop ecosystems.

Experience with Alation data catalog for metadata management and governance, facilitating data discovery and collaboration across the organization.

Integrating ClickHouse with other data processing and visualization tools (e.g., Apache Kafka, Apache Spark, Grafana) for data ingestion, processing, and analysis.

Used Bit Bucket to collaboratively interact with the other team members.

Involved in Agile methodologies, daily scrum meetings and sprint planning.

Environment:Apache Airflow, Spark, Hive, click House, AWS Glue, AWS S3, AWS Redshift, AWS EMR, Luigi, Docker, MySQL, PostgreSQL, Pandas, Snowflake, AWS services (S3, Redshift, RDS, DynamoDB), Apache Kafka, Grafana, Python, Terraform, GitLab, Oozie, Apache Kafka, Bit Bucket, Agile methodologies.

Sr. DataEngineer

Citizens Bank – Rhode Island (Remote) Sep 2018 – Oct 2021

Responsibilities:

Developed and managed Azure Data Factory pipelines that extracted data from various data sources, transformed it according to business rules, and loaded into an Azure SQL database.

Developed and managed Databricks python scripts that utilized Pyspark and consumed APIs to move data into an Azure SQL database

Created a new data quality check framework project in Python that utilized pandas

Implemented source control and development environments for Azure Data Factory pipelines utilizing Azure Repos.

Optimized dashboard performance and user experience by applying best practices in data visualization design and layout.

Implemented advanced analytical functions and calculations within quick Sight to uncover trends, patterns, and outliers in large datasets.

Configuring Luigi schedulers, workers, and resources to optimize task execution and resource utilization.

Integrating Airflow with cloud services (e.g., AWS, GCP, Azure) for seamless data ingestion, processing, and storage.

Implementing Docker in CI/CD pipelines to automate testing, building, and deploying containerized applications with tools like Jenkins or GitLab CI/CD.

Monitoring and managing Oozie job execution, logs, and job history to track workflow progress and performance.

Utilizing Amazon Redshift Spectrum to query data directly from Amazon S3, enabling seamless integration with data lakes and external storage.

Providing documentation, training, and support to users and teams on PostgreSQL best practices, troubleshooting, and performance optimization.

Collaborated with external vendors and partners to integrate third-party data sources and tools, expanding the capabilities and functionalities of the Quick Sight platform.

Implemented data governance policies and security measures within Quick Sight to ensure compliance with regulatory requirements and protect sensitive information.

Utilized Apex Salesforce to implement various enhancements into Salesforce

Creating and managing AWS Glue crawlers to automatically discover, catalog, and infer schemas from various data sources.

Created Power BI Datamart’s and reports for various stakeholders in the business

Developed and maintained ETL packages in the form of IBM Data stage jobs and SSIS packages.

Debugging and troubleshooting Kubernetes deployments using kubectl commands, container logs, and cluster events to identify and resolve issues.

Loading data into BigQuery from various sources, such as Google Cloud Storage, Cloud SQL, Google Sheets, and other databases, using ingestion methods like batch loading, streaming inserts, or Dataflow pipelines.

Designed and coded a utility utilizing C# and .NET to create a reports dashboard from ETL metrics tables.

Designed, tuned and documented a SSAS cube that acted as an OLAP data source for a dashboard that kept track of process guidance concepts throughout the company utilizing Visual Studio 2010.

Created a Microsoft Excel file that acted as an aggregate report on company vacation hours using the PowerPivot plugin.

Implemented machine learning models in Python to perform predictive analytics for risk assessment and fraud detection in financial transactions.

Developed a Windows PowerShell script that acted as an email error notification system for an internal ETL management framework for SSIS packages.

Extending Airflow's functionality with plugins and extensions to support specialized use cases and industry-specific requirements.

Experienced in the creation of Data Lake by extracting data from various sources with different file format JSON, Parquet, CSV and RDBMS into Azure Data Lakes

Integrating MySQL with other data storage and processing systems (e.g., Hadoop, Spark, Kafka) using connectors and APIs for data ingestion and processing.

Ingested data into Azure Storage or Azure SQL, Azure DW and processed the data in Azure Databricks.

Worked on Design/creation of complex SSIS or DTSX packages utilizing Package.

Developing data links and data loading scripts, including data transformations, reconciliations and accuracies.

Worked on several Azure storage systems like BLOB and Data Lake.

Skilled in conducting in-depth data analysis using Tableau, including trend analysis, forecasting, cohort analysis, and segmentation to identify patterns, trends, and outliers in data sets.

Integrating Oozie with Hadoop ecosystem components (e.g., HDFS, YARN, Hive and Spark) for data processing and analytics.

Developed and managed data ingestion pipelines with Kinesis Data Streams, ensuring reliable and scalable data flow from various sources

Converted older SSIS ETL packages to a new design that made use of custom ETL management framework.

Highlight the specific data visualization techniques employed in each project, such as bar charts, line charts, scatter plots, histograms, and pie charts, to effectively represent different types of data and insights.

Worked on Parameterized reports, Linked reports, Ad hoc reports, Drilldown reports and Sub reports. Designed and developed stored procedures, queries and views necessary to support SSRS reports.

Environment:Azure Data Factory, Databricks, PySpark, Luigi, Airflow, Docker, AWS Glue, Amazon Redshift Spectrum, PostgreSQL, Quick Sight, Click House, Power BI, IBM Data stage, SSIS, Kubernetes, BigQuery, SSAS, PowerPivot, PowerShell, T-SQL, Azure Synapse, Azure Data Lake, MySQL, Tableau, Oozie, Snowflake, SSRS.

Azure Data Engineer

AT & T – Dallas, TX Dec 2016 – Jul 2018

Responsibilities:

Analyze, design, and build Modern data solutions using Azure PaaS service to support visualization ofdata.

Developed spark programming code in Python Data bricks workbooks.

Performance tuning of SQOOP, Hive and Spark jobs.

Implemented proof of concept to analyze the streaming data using Apache Spark with Python; UsedMaven/SBT for build and deploy the Spark programs.

Monitoring Luigi task execution, progress, and dependencies to track workflow status and performance.

Created Application Interface Document for the downstream to create a new interface to transfer andreceive the files through Azure Data Share.

Configuring and orchestrating AWS Glue jobs and triggers to automate data transformation and loading processes.

Utilizing PostgreSQL's support for foreign data wrappers (FDWs) to integrate with external data sources and databases for federated querying and data access.

Creating pipelines, data flows and complex data transformations and manipulations using ADF andPySpark with Databricks.

Automating deployment and version control of Airflow DAGs using CI/CD pipelines and source code management tools.

Writing and optimizing SQL queries in BigQuery, leveraging features like standard SQL, nested and repeated fields, and user-defined functions (UDFs) to transform and analyze data.

Ingested data in mini-batches and performs RDD transformations on those mini-batches of data by using Spark Streaming to perform streaming analytics in Data bricks.

Implementing data quality checks, error handling, and retries to ensure reliability and integrity in Oozie workflows.

Providing training, documentation, and support to users and stakeholders on using Apache Airflow effectively for ETL orchestration.

Implemented Spark using Scala and Spark SQL for faster testing and processing of data.

Optimizing AWS Glue job performance and resource utilization by tuning parameters, partitioning data, and optimizing queries.

Worked on complex U-SQL for the data transformation and loading table and report generation.

Used Docker containers using Docker Swarm or Kubernetes to automate deployment, scaling, and load balancing.

Involved in developing Confidential Data Lake and in building Confidential Data Cube on Microsoft Azure HDINSIGHT cluster.

Designed and Published Power BI Visualizations and Dashboards to various Business Teams forBusiness use and Decision making.

Integrating Docker with version control systems (e.g., Git) to manage Dockerfile and Docker Compose configurations.

Integrating Luigi with cloud platforms and storage services for seamless data integration and processing.

Created Linked service to land the data from Caesars SFTP location to Azure Data Lake.

Using Git to update existing versions of the Hadoop Py-Spark script to its new model.

Environment:Azure Data Engineer, Spark SQL, Python, Teradata, SQOOP, Hive, Apache Spark, Maven, SBT, ETL, Luigi, Azure Data Share, TALEND, Linux, AWS Glue, Amazon Redshift, Kubernetes, CI/CD, Jenkins, GitLab CI/CD, ArgoCD, PostgreSQL, ADF, PySpark, Databricks, Airflow, BigQuery, Tableau, Scala, Spark Streaming, Oozie, Talend ESB, Azure HDInsight, Spark Core, Power BI, Docker, Kubernetes, Azure Data Lake, HDINSIGHT, HIVE, Spark Concepts, Snowflake DB, Informix, Sybase, Git.

Python/Hadoop Developer

Couth InfoTech PVT LTD- Hyderabad, India Apr 2015 – July 2016

Responsibilities:

Develop of Spark Sql application, Big Data Migration from Teradata to Hadoop and reduce Memory utilization in Teradata analytics.

Requirement Gathering and Leading Team for the development of the Big Data environment and Spark ETL logics migrations.

Involve in requirement gathering from the Business Analysts, and participate in discussions with users, functional analysts for the Business logics implementation.

Responsible for end-to-end design on Spark Sql, Development to meet the requirements. Advice the business on best practices in the Spark Sql while making sure the solution meet the business needs.

Debugging Docker containers using logs, container inspection, and remote debugging tools to troubleshoot issues and errors.

Emphasize the importance of documentation and training provided to users, including user guides, documentation manuals, and training sessions conducted, to ensure effective utilization of Tableau dashboards.

Configuring Kubernetes storage solutions such as persistent volume claims (PVCs) and storage classes to provide persistent storage for stateful applications.

Implementing Amazon Redshift security best practices, including IAM authentication, VPC security groups, and encryption-at-rest to protect sensitive data and ensure compliance with regulatory requirements.

Implementing data quality checks, data validation rules, and error handling mechanisms in AWS Glue scripts and jobs.

Analyzed large amounts of data sets to determine optimal way to aggregate and report on it.

Creating Views on Top of the HIVE tables and give it to customers for the analytics.

Analyzing Hadoop cluster and different big data analytic tools including Pig, HBase and Sqoop. Worked with Linux systems and RDBMS database on a regular basis in order to ingest data using Sqoop.

Collected and aggregated large amounts of web log data from different sources such as webservers, mobile and network devices using Apache and stored the data into HDFS for analysis.

Developed UNIX shell scripts to load large number of files into HDFS from Linux File System Developed Custom Input Formats in MapReduce jobs to handle custom file formats and to convert them into key-value pairs.

Involved in making Hive tables, stacking information, composing hive inquiries, producing segments and basins for enhancement.

Environment:Python, Hadoop, Spark SQL, Power BI, Teradata, Docker, Tableau, Kubernetes, Oozie, Amazon Redshift, PostgreSQL, Hive, AWS Glue, Sqoop, Pig, HBase, Linux, RDBMS, Apache, Hortonworks Data Platform, MapReduce, Unix shell scripting.

SQL/ SSIS Developer

Adran, Hyderabad, India Jun 2013 – Mar 2015

Responsibilities:

Created SSIS Packages to import and export data from Excel, Text files and transformed data from various data sources using OLE DB connection by creating various SSIS package.

Designed and implemented ETL processes based in SQL, T-SQL, stored procedures, triggers, views, tables, user defined functions and security using SQL SERVER 2012, SQL Server Integration Service.

Wrote Triggers and Stored Procedures to capture updated and deleted data from OLTP systems.

Identified slow running query and optimization of stored procedures and tested applications for performance, data integrity using SQL Profiler.

Implementing query caching and result caching in BigQuery to accelerate query execution and reduce costs for repeated queries.

Extensively worked with SSIS tool suite, designed and created mapping using various SSIS transformations like OLEDB command, Conditional Split, Lookup, Aggregator, Multicast and Derived Column.

Configured SQL mail agent for sending automatic emails on errors and developed complex reports using multiple data providers, using defined objects, chart.

Querying and manipulating multidimensional cube data through MDX Scripting.

Developed drill through, drill down, parameterized, cascaded and sub-reports using SSRS.

Responsible for ongoing maintenance and change management to existing reports and optimize report performance SSRS.

Environment:SSIS Packages, SQL Server 2012, SQL Server Integration Services (SSIS), SQL Server, T-SQL, stored procedures, triggers, SQL Profiler, Big Query, OLE DB, SSIS transformations OLEDB command,, SQL mail agent, MDX Scripting, SSAS (SQL Server Analysis Services), SSRS (SQL Server Reporting Services) KPIs.

Contact this candidate