Nikhil Gupta
Database Developer
Nikhil is a senior data engineer with over four years of experience, capable of grasping new concepts quickly. He builds highly scalable data-intensive applications, comprehends various applications and tech stacks, and manages clients and stakeholders across the hierarchy. With his technical depth and presentation skills, Nikhil's most striking quality is his commitment to providing high-quality solutions.
Portfolio
Availability
Preferred Environment
Linux, Ubuntu, Git, PyCharm, Sublime Text 3
The most amazing...
...thing I've built is a BI product that translates data into insights with written explanations, scaling from 0 to 12 clients in less than two years.
Work Experience
Data Engineer II
Zepto
- Designed and implemented an event-driven pipeline that gets triggered based on file upload.
- Implemented an end-to-end change data capture (CDC) pipeline capturing data from source tables in real time.
- Optimized the existing Amazon Redshift cluster for better performance and to prevent frequent shutdowns.
- Architected the FMCG Dynamic Pricing Engine. Designed the data flow to automate the price changes in FMCG items on the Zepto app. The project's impact is estimated to be around 0.2 million INR, given the current revenue leakage will be minimized.
- Led the end-to-end development of an in-house streaming pipeline and achieved an SLA of 10 seconds, transmitting 100MB per second end-to-end. The pipeline involved Debezium, Kafka, Kafka Connect, PostgreSQL, ClickHouse, and Apache Pinot.
Senior Data Engineer
Xpressbees
- Designed an in-house scheduling framework that reduced Amazon MWAA costs by 50% by leveraging Snowflake tasks.
- Reduced engineering time and effort with the framework that served as a self-service scheduling for the analytics and client MIS teams to schedule custom SQL and stored procedures.
- Oversaw the design and implementation of this framework end to end.
Data Engineer
PepsiCo (via Toptal)
- Built the entire data pipeline for their billing dashboard that helped PepsiCo track costs across different cloud vendors and services.
- Collected costing and tagging data from AWS, Azure, Snowflake, and Datadog and streamlined it into our final data model. We built our presentation layer using ThoughtSpot.
- Created GitHub Actions for different environments that built a Docker image—that packaged all the data build tool (dbt) models—and pushed it to the ECR repo.
- Wrote an Airflow DAG that ran this Docker image on a cadence to execute the dbt models in production. Wrote data transformation logic using dbt.
Senior Data Engineer
Xpressbees
- Wrote data transformations using SQL to onboard tables from new data sources onto the data platform.
- Built DAG scripts to schedule data loads hourly from original databases, Postgres, MySQL, and MongoDB, into analytical layer tables for the analytics data warehouse.
- Created Kafka connector Debezium configuration files for setting up the change data capture (CDC) from source databases into the data lake.
- Conducted code reviews and mentored junior data engineers in the team.
Data Engineer
vPhrase
- Ingested approximately 80GB of daily Phrazor product and plugin usage data into Amazon S3 data lake from multiple client and in-house servers.
- Imported the data from the data lake into the Snowflake data warehouse for transformations and analytics.
- Composed ingestion and transformation scripts in SQL to load data from raw and staging tables into the analytical layer tables, eventually used for analytics by the product manager and CTO.
- Wrote the Airflow DAG scripts using Python and orchestrated the entire pipeline.
Data Engineer
vPhrase
- Designed the end-to-end ETL pipeline for a financial client to power their stocks and mutual fund recommendation algorithm.
- Ingested data from third-party vendor databases, using Debezium, Kafka, and Kafka Connect for the CDC into the S3 data lake.
- Wrote the cleaning, transformation, and data processing scripts using Python and Spark to calculate around 100-150 financial KPIs.
- Orchestrated the entire ETL pipeline using Airflow running on an Amazon EC2 instance.
Data Engineer
vPhrase
- Built a BI software called Phrazor from scratch. It went from zero to 12 full-time clients, including over 200 licenses in three years.
- Modeled and designed the product's back-end data model and knowledge base, powering analytics on the user's reports and dashboards.
- Led the design and implementation of formulae using Spark and Pandas. It handled crunching data to calculate industry-specific KPIs for the user's reports.
- Designed a multi-level drill-down feature to diagnose sudden drops or growths in KPIs.
- Created and maintained unit tests to cover 90% of the codebase.
Experience
ETL Pipeline for Stocks and Mutual Funds Recommendation System
Built an end-to-end ETL pipeline that supplied data to a stocks and mutual funds recommendation algorithm for a leading trading firm in India.
DATA SOURCES
The project required historical data from that client, data from the client's third-party vendors, and data from various APIs.
TECH STACK AND OVERVIEW
Data ingestion using Debezium, Kafka, and Kafka Connect for the CDC from vendor databases into the S3 data lake. data cleaning/data transformation using Python + Spark, and data processing, which involved calculating around 100-150 financial KPIs on top of the ingested data. The data cleaning and processing pipeline was orchestrated using Apache Airflow running on an EC2 instance.
Phrazor Product Usage Analytics Pipeline
Built an end-to-end ETL pipeline to get in Click Stream data from clients and internal servers for analytics.
DATA SOURCES
Ingested approximately 80-100GB of daily Phrazor product and Phrazor plugin usage data into AWS S3 Data Lake from multiple client and in-house servers.
OVERVIEW
Cleaned and transformed the raw data and transferred that from S3 to Snowflake Data Warehouse for further analytics. The entire pipeline was orchestrated using Apache Airflow and running on an AWS EC2 instance.
IMPACT
This pipeline helped the product managers make smarter product decisions, run A/B tests, and analyze how users use the platform. This layer also supplied clean data to the data scientists for advanced analytics.
IEEE-CIS Fraud Detection Kaggle Competition
Given data for credit card transactions, the solution was supposed to identify fraudulent transactions given data about each transaction. It was a classic example of highly skewed data, with the positive class being less than 1% of the entire data.
Tested and employed data imbalance handling techniques and finally went ahead with hard negative mining. My team and I engineered many features that helped us reach that top 7%.
Skills
Languages
Python 3, SQL, Snowflake, Python
Libraries/APIs
Pandas, NumPy, Beautiful Soup, PySpark, Amazon EC2 API, GitHub API
Tools
Apache Airflow, Microsoft Power BI, Spark SQL, GitHub, Tableau, Looker, Git, PyCharm, Sublime Text 3, Amazon Elastic MapReduce (EMR), Terraform, AWS Glue
Paradigms
ETL, Business Intelligence (BI), Unit Testing, Agile, Database Design, Dimensional Modeling, Data Science, Continuous Delivery (CD), Continuous Integration (CI), DevOps
Storage
PostgreSQL, MySQL, Databases, Relational Databases, Redshift, Data Lakes, Database Modeling, Data Pipelines, JSON, DB, Database Architecture, Amazon S3 (AWS S3), MongoDB, Azure SQL, Datadog
Other
Data Warehousing, Data Engineering, Data Warehouse Design, Data Visualization, Data Analysis, Database Analytics, ETL Tools, Data, Big Data Architecture, ETL Development, Data Modeling, ELT, Big Data, Data Analytics, Database Schema Design, Data Cleaning, Data Processing, Star Schema, Pipelines, Data Extraction, Query Optimization, Web Scraping, Back-end Development, Dashboards, Reports, Information Visualization, Data Transformation, Message Queues, Database Optimization, Data Architecture, Debezium, CDC, Machine Learning, EDA, EMR, Parquet, APIs, Consumer Packaged Goods (CPG), Serverless, Full-stack, CI/CD Pipelines, Change Data Capture, Real Estate, Data Manipulation, Amazon RDS
Frameworks
Spark, Apache Spark
Platforms
Apache Kafka, Amazon EC2, Amazon Web Services (AWS), Oracle, Linux, Ubuntu, Azure, Docker, AWS Lambda
Education
Bachelor's Degree in Computer Science
Mumbai University - Mumbai, India
Certifications
Data Science
GreyAtom School of Data Science