Nikhil Gupta

Database Developer

Location

Mumbai, Maharashtra, India

Toptal Member Since

October 12, 2022

Nikhil is a senior data engineer with over four years of experience, capable of grasping new concepts quickly. He builds highly scalable data-intensive applications, comprehends various applications and tech stacks, and manages clients and stakeholders across the hierarchy. With his technical depth and presentation skills, Nikhil's most striking quality is his commitment to providing high-quality solutions.

Data Warehousing Data Engineering Data Warehouse Design Data Visualization Data Analysis Big Data Architecture Python 3 SQL Python PostgreSQL MySQL Apache Airflow ETL Pandas Databases

Nikhil is available for hire

Hire Nikhil

Portfolio

Zepto

Python 3, Python, SQL, Debezium, CDC, Change Data Capture, Apache Kafka...

Xpressbees

Apache Kafka, Apache Airflow, Python 3, SQL, Snowflake, Query Optimization...

PepsiCo (via Toptal)

Python, Snowflake, Apache Airflow, Terraform, GitHub, GitHub API...

Experience

Python 3 - 5 years Apache Airflow - 4 years MySQL - 4 years Data Warehousing - 4 years PostgreSQL - 4 years Snowflake - 3 years Spark - 3 years Apache Kafka - 2 years

Location

Availability

Part-time

Preferred Environment

Linux, Ubuntu, Git, PyCharm, Sublime Text 3

The most amazing...

...thing I've built is a BI product that translates data into insights with written explanations, scaling from 0 to 12 clients in less than two years.

Work Experience

2023 - 2023

Data Engineer II

Zepto

Designed and implemented an event-driven pipeline that gets triggered based on file upload.
Implemented an end-to-end change data capture (CDC) pipeline capturing data from source tables in real time.
Optimized the existing Amazon Redshift cluster for better performance and to prevent frequent shutdowns.
Architected the FMCG Dynamic Pricing Engine. Designed the data flow to automate the price changes in FMCG items on the Zepto app. The project's impact is estimated to be around 0.2 million INR, given the current revenue leakage will be minimized.
Led the end-to-end development of an in-house streaming pipeline and achieved an SLA of 10 seconds, transmitting 100MB per second end-to-end. The pipeline involved Debezium, Kafka, Kafka Connect, PostgreSQL, ClickHouse, and Apache Pinot.

Technologies: Python 3, Python, SQL, Debezium, CDC, Change Data Capture, Apache Kafka, Back-end Development, Data Manipulation, Dashboards, Reports, Information Visualization, Data Warehousing, Data Warehouse Design, Data, Looker, AWS Glue, AWS Lambda, Amazon RDS, Big Data Architecture, Data Transformation, Message Queues, ETL Development, Database Optimization, Database Architecture, Data Architecture

2022 - 2023

Senior Data Engineer

Xpressbees

Designed an in-house scheduling framework that reduced Amazon MWAA costs by 50% by leveraging Snowflake tasks.
Reduced engineering time and effort with the framework that served as a self-service scheduling for the analytics and client MIS teams to schedule custom SQL and stored procedures.
Oversaw the design and implementation of this framework end to end.

Technologies: Apache Kafka, Apache Airflow, Python 3, SQL, Snowflake, Query Optimization, Data Engineering, ETL Tools, Web Scraping, Data Manipulation, Dashboards, Reports, Information Visualization, Data Warehousing, Data Warehouse Design, Data, Looker, AWS Glue, AWS Lambda, Amazon RDS, Big Data Architecture, Data Transformation, Message Queues, ETL Development, Database Optimization, Database Architecture, Data Architecture

2022 - 2022

Data Engineer

PepsiCo (via Toptal)

Built the entire data pipeline for their billing dashboard that helped PepsiCo track costs across different cloud vendors and services.
Collected costing and tagging data from AWS, Azure, Snowflake, and Datadog and streamlined it into our final data model. We built our presentation layer using ThoughtSpot.
Created GitHub Actions for different environments that built a Docker image—that packaged all the data build tool (dbt) models—and pushed it to the ECR repo.
Wrote an Airflow DAG that ran this Docker image on a cadence to execute the dbt models in production. Wrote data transformation logic using dbt.

Technologies: Python, Snowflake, Apache Airflow, Terraform, GitHub, GitHub API, Continuous Delivery (CD), Continuous Integration (CI), DevOps, Amazon Web Services (AWS), Datadog, APIs, Microsoft Power BI, Data Visualization, Business Intelligence (BI), Azure, Data Analysis, Database Analytics, Docker, CI/CD Pipelines, Query Optimization, Data Engineering, ETL Tools, Back-end Development, Data Manipulation, Dashboards, Reports, Information Visualization, Data Warehousing, Pandas, Data Warehouse Design, Data, Amazon RDS, Big Data Architecture, Data Transformation, Message Queues, ETL Development, Database Architecture, Data Architecture

2022 - 2022

Senior Data Engineer

Xpressbees

Wrote data transformations using SQL to onboard tables from new data sources onto the data platform.
Built DAG scripts to schedule data loads hourly from original databases, Postgres, MySQL, and MongoDB, into analytical layer tables for the analytics data warehouse.
Created Kafka connector Debezium configuration files for setting up the change data capture (CDC) from source databases into the data lake.
Conducted code reviews and mentored junior data engineers in the team.

Technologies: Python, Python 3, Snowflake, CDC, SQL, Apache Airflow, PostgreSQL, MongoDB, MySQL, Data Warehousing, Data Warehouse Design, Pipelines, Data Pipelines, Amazon Web Services (AWS), Data Cleaning, Data Lakes, Big Data, ETL, Data Modeling, Dimensional Modeling, Data Extraction, DB, ELT, Databases, Oracle, Serverless, Relational Databases, Full-stack, Data Visualization, Business Intelligence (BI), Data Analysis, Database Analytics, Docker, Query Optimization, Data Engineering, ETL Tools, Web Scraping, Apache Kafka, Back-end Development, Data Manipulation, Dashboards, Reports, Information Visualization, Pandas, Data, AWS Glue, AWS Lambda, Amazon RDS, Big Data Architecture, Data Transformation, Message Queues, ETL Development, Database Architecture, Data Architecture

2021 - 2022

Data Engineer

vPhrase

Ingested approximately 80GB of daily Phrazor product and plugin usage data into Amazon S3 data lake from multiple client and in-house servers.
Imported the data from the data lake into the Snowflake data warehouse for transformations and analytics.
Composed ingestion and transformation scripts in SQL to load data from raw and staging tables into the analytical layer tables, eventually used for analytics by the product manager and CTO.
Wrote the Airflow DAG scripts using Python and orchestrated the entire pipeline.

Technologies: Python, Snowflake, Amazon S3 (AWS S3), Apache Airflow, Amazon EC2, Data Analytics, Pipelines, Data Pipelines, ETL, SQL, Data Modeling, Terraform, Dimensional Modeling, Data Extraction, DB, ELT, Databases, Oracle, Serverless, Relational Databases, MySQL, Full-stack, Microsoft Power BI, Data Visualization, Business Intelligence (BI), Data Analysis, Database Analytics, Apache Spark, Docker, Query Optimization, Data Engineering, ETL Tools, Back-end Development, Data Manipulation, Dashboards, Reports, Information Visualization, Pandas, Data Warehouse Design, Data, Amazon RDS, Big Data Architecture, Data Transformation, ETL Development, Data Architecture

2020 - 2021

Data Engineer

vPhrase

Designed the end-to-end ETL pipeline for a financial client to power their stocks and mutual fund recommendation algorithm.
Ingested data from third-party vendor databases, using Debezium, Kafka, and Kafka Connect for the CDC into the S3 data lake.
Wrote the cleaning, transformation, and data processing scripts using Python and Spark to calculate around 100-150 financial KPIs.
Orchestrated the entire ETL pipeline using Airflow running on an Amazon EC2 instance.

Technologies: Spark, Amazon Web Services (AWS), Data Lakes, Data Warehousing, Python, Python 3, SQL, ETL, Data Pipelines, Pipelines, Data Modeling, Terraform, Dimensional Modeling, Tableau, Data Extraction, DB, ELT, Databases, Oracle, Relational Databases, Data Visualization, Business Intelligence (BI), Data Analysis, Database Analytics, Apache Spark, Data Engineering, ETL Tools, Web Scraping, Data Manipulation, Reports, Data, Big Data Architecture, Data Transformation, ETL Development, Data Architecture

2019 - 2020

Data Engineer

vPhrase

Built a BI software called Phrazor from scratch. It went from zero to 12 full-time clients, including over 200 licenses in three years.
Modeled and designed the product's back-end data model and knowledge base, powering analytics on the user's reports and dashboards.
Led the design and implementation of formulae using Spark and Pandas. It handled crunching data to calculate industry-specific KPIs for the user's reports.
Designed a multi-level drill-down feature to diagnose sudden drops or growths in KPIs.
Created and maintained unit tests to cover 90% of the codebase.

Technologies: Python, Spark SQL, Spark, Apache Airflow, Database Design, Database Modeling, Database Schema Design, Data Modeling, Dimensional Modeling, Consumer Packaged Goods (CPG), Tableau, Data Extraction, DB, ELT, Microsoft Power BI, Data Visualization, Business Intelligence (BI), Data Analysis, Database Analytics, Docker, Data Engineering, Beautiful Soup, Data, Data Transformation

Experience

ETL Pipeline for Stocks and Mutual Funds Recommendation System

SCOPE
Built an end-to-end ETL pipeline that supplied data to a stocks and mutual funds recommendation algorithm for a leading trading firm in India.

DATA SOURCES
The project required historical data from that client, data from the client's third-party vendors, and data from various APIs.

TECH STACK AND OVERVIEW
Data ingestion using Debezium, Kafka, and Kafka Connect for the CDC from vendor databases into the S3 data lake. data cleaning/data transformation using Python + Spark, and data processing, which involved calculating around 100-150 financial KPIs on top of the ingested data. The data cleaning and processing pipeline was orchestrated using Apache Airﬂow running on an EC2 instance.

Phrazor Product Usage Analytics Pipeline

SCOPE
Built an end-to-end ETL pipeline to get in Click Stream data from clients and internal servers for analytics.

DATA SOURCES
Ingested approximately 80-100GB of daily Phrazor product and Phrazor plugin usage data into AWS S3 Data Lake from multiple client and in-house servers.

OVERVIEW
Cleaned and transformed the raw data and transferred that from S3 to Snowﬂake Data Warehouse for further analytics. The entire pipeline was orchestrated using Apache Airﬂow and running on an AWS EC2 instance.

IMPACT
This pipeline helped the product managers make smarter product decisions, run A/B tests, and analyze how users use the platform. This layer also supplied clean data to the data scientists for advanced analytics.

IEEE-CIS Fraud Detection Kaggle Competition

A project that achieved the top 7% bronze medal out of over 6,700 teams across the globe.

Given data for credit card transactions, the solution was supposed to identify fraudulent transactions given data about each transaction. It was a classic example of highly skewed data, with the positive class being less than 1% of the entire data.

Tested and employed data imbalance handling techniques and finally went ahead with hard negative mining. My team and I engineered many features that helped us reach that top 7%.

Skills

Languages

Python 3, SQL, Snowflake, Python

Libraries/APIs

Pandas, NumPy, Beautiful Soup, PySpark, Amazon EC2 API, GitHub API

Tools

Apache Airflow, Microsoft Power BI, Spark SQL, GitHub, Tableau, Looker, Git, PyCharm, Sublime Text 3, Amazon Elastic MapReduce (EMR), Terraform, AWS Glue

Paradigms

ETL, Business Intelligence (BI), Unit Testing, Agile, Database Design, Dimensional Modeling, Data Science, Continuous Delivery (CD), Continuous Integration (CI), DevOps

Storage

PostgreSQL, MySQL, Databases, Relational Databases, Redshift, Data Lakes, Database Modeling, Data Pipelines, JSON, DB, Database Architecture, Amazon S3 (AWS S3), MongoDB, Azure SQL, Datadog

Other

Data Warehousing, Data Engineering, Data Warehouse Design, Data Visualization, Data Analysis, Database Analytics, ETL Tools, Data, Big Data Architecture, ETL Development, Data Modeling, ELT, Big Data, Data Analytics, Database Schema Design, Data Cleaning, Data Processing, Star Schema, Pipelines, Data Extraction, Query Optimization, Web Scraping, Back-end Development, Dashboards, Reports, Information Visualization, Data Transformation, Message Queues, Database Optimization, Data Architecture, Debezium, CDC, Machine Learning, EDA, EMR, Parquet, APIs, Consumer Packaged Goods (CPG), Serverless, Full-stack, CI/CD Pipelines, Change Data Capture, Real Estate, Data Manipulation, Amazon RDS

Frameworks

Spark, Apache Spark

Platforms

Apache Kafka, Amazon EC2, Amazon Web Services (AWS), Oracle, Linux, Ubuntu, Azure, Docker, AWS Lambda

Education

2015 - 2019

Bachelor's Degree in Computer Science

Mumbai University - Mumbai, India

Certifications

JANUARY 2018 - PRESENT

Data Science

GreyAtom School of Data Science