Akshay Sehgal

I am a Data Scientist with 7 years of experience, currently working as a lead (General manager, SME-1) at Reliance Industries, where I design, train and deploy ML models powering enterprise scale platforms and products. I currently lead a team of about 10 data scientists working closely with full stack developers, product teams and dev-ops to map out and productionalizing AI/ML architectures for cloud based applications related to employee management systems and services. Some of my projects include, geo-spatial route matching, distributed virtual assistants, recommendation engines, document semantic matching, anomaly detection on image data and natural language to querying a database.

Previously I was heading the Strategy and Development of data powered products for a startup company called iPredictt Data Labs which I co-founded. My career in the data science domain started off in Mu-Sigma, a pure play analytics firm. I have worked cross domain including HR, Advertising, Retail and Entertainment industries.

I have experience with supervised learning methods such as generalized linear models, decision trees, ensemble models (Stacknet, Xgboost, RF), support vector machines (SVM), probabilistic models, deep learning and unsupervised models such as K-means clustering, Gaussian mixture models, hierarchical DBScan for geo-spatial data, PCA, deep belief networks, RBMs and selforganizing maps. This includes applying NLP based models such as Seq2Seq models with attention, word embeddings (fasttext, word2vec, Glove), LSTM/GRUs, 1d convolutional networks. I enjoy exploring GANs, Deep dream networks, Reinforcement learning, Genetic algorithms, computer vision and network analysis. I have frequently worked with Django and Flask frameworks deployed via Ngrok or Docker on Amazon EC2, Azure VMs, and server-less deployment using Amazon Lambda (Zappa).

Outside office I explore metaphysics, epistimology, theoritical physics, amateur mathematics and graphic designing. I have been a guitarist for over 10 years now.

I am a frequent contributor on Digital Vidhya, Code Gladiators, Kaggle1. I have substantial experience working with technical / non-technical clients, top management and have participated as a speaker at multiple tech events across India2 3 4. I also have 3 technology patents under my name (201721005644, 201621034521, 201621034522).


Natural Language querying of databases

Lead Data Scientist, Reliance Industries

Building a python framework which allows natural language querying on small-medium scale databases by using seq2seq neural networks to translate a query into a SQL query. The model is capable of predicting search and condition columns, conditions and aggregations needed in the sql query which is then run on the given database. The result is used with natural language generation to respond to the user as an answer to the query.

Tools used: Python, NLP, Word embeddings, Seq2Sql, Seq2seq with attention, SimpleNLG, Xsql framework, Keras, Scikit-learn.

Dec 2018 – Ongoing

JD-CV matching algorithm for candidate shortlisting

Lead Data Scientist, Reliance Industries

Building a CV sourcing and shortlisting platform that allows hiring managers to access a ranked order of profiles matching the requirement. These profiles are enriched using multiple data sources and are parsed to extract education, experience, skillsets, project and personal information from the profile. This is followed by document clustering to obtain relevant domain cluster, and document similarity (ranking) algorithms to match JD document to profiles. A reinforcement learning layer is being added to capture and personalise hiring manager preferences and behaviours, while ensuring company standards and requirements.

Tools used: Python, NLP, Word embeddings, t-SNE, Doc2vec, PCA, Spacy, fuzzy matching, GMM, document classification using LSTMs, reinforcement learning, Keras.

August 2018 – Ongoing

Distributed Virtual Assistant Development Toolkit

Lead Data Scientist, Reliance Industries

Building a python based tool allowing non-technical users to design, train and deploy closed domain virtual assistants using a GUI. The bots are then integrated into a meta-model that allows intermediate intent switching to an intent on another bot deployed on some other server. Also, allows users to integrate APIs during any part of the conversation (for assisting user by fetching data, validating user inputs against database or completing a transaction on a service such as travel bookings, leave/regularization systems, HR queries etc). Integration with live systems and applications is ongoing.

Tools used: Python, NLP, NLG, RASA framework, entity extraction, Markov chains, LSTM based neural networks, Django, Docker, nginx, Keras, Scikit-learn.

June 2018 – Ongoing

Course Recommendation Engine for Reliance LMS

Lead Data Scientist, Reliance Industries

Productionalized a course recommendation engine for 30,000+ employees which integrates various businesses at user end and various learning partners of Reliance at content end. Utilized employee demographics and organisational data to create a multiple recommendation systems integrated via a multi-arm bandit based architecture to personalise each user’s experience. Have used matrix decompositions, fuzzy logic, collaborative filtering, association models, context clustering and reinforcement learning.

Tools used: Python, Text analysis, NLP, Collaborative filtering, SVD, Search strategies, Multi-arm, Bandits, Reinforcement Learning, Scikit-learn.

Oct 2017 – May 2018

Employee Car-pooling service using Geo-Spatial clustering

Lead Data Scientist, Reliance Industries

Designed a unsupervised model over employee address database to create geospatial clusters based on density of residence across the map and used polygon matching with dynamic programming to calculate delta in driver & passenger routes. This was followed a route optimization algorithm using network analysis of the graph of clusters and then a match making model for route matching which estimated polygon similarity between optimal (estimated) routes of the passengers and car driver. This model is currently being housed into a B2B employee services module called Share-a-Ride.

Tools used: Python, Google Enterprise API, Hierarchical DBScan, Dynamic Programming, Network centralities and route optimization, Polygonal similarity techniques.

Sep 2017 – Oct 2018

Expression & Empathy Detection

Lead Data Scientist, Reliance Industries

A two part module which involves expression detection using image processing applied over a live camera feed (interview) utilizing OpenCV and NLP based empathy detection algorithm over text data (emails, skype, communities) utilizing SVM over a 60 million tweet dataset categorized by types of empathy and emotions. This module is being housed in various upcoming systems which improve the quality of hire and employee services as part of the PMS 2.0 project.

Tools used: Python, NLP, Word2Vec, Naive bayes, SVM, OpenCV, Tensorflow.

Mar 2018 – June 2018

Viewer interest prediction on Rental Listings on Renthop (Kaggle)

Kaggler, ranked top 7% globally

Objective was to predict how popular an apartment rental listing is based on the listing content like text description, photos, number of bedrooms, price, etc. The data comes from renthop.com, an apartment listing website. I created an ensemble model using xgboost wrapped in a cross validator, stacked over KazAnova's StackNet with random forest and SVM. Features used included basic features, simple calculated features, constructed features over manager_id using tf-idf and clustered longitude-latitude positions, and finally magic feature. Model iterations were done with parameter tuning followed by averaging and geometric mean of predictions. The accuracy measure was log loss and my best model got me top 7% global ranking on Kaggle.

Tools used: Python, NLTK, SVM, K-Means, Random Forest, XGBoost with Cross Validation, StackNet by KazAnova.

Mar 2017 – May 2017

Recruitment decision making tool called Careerletics Enterprise

Head of DS Products, iPredictt

Careerletics Enterprise is an intelligent platform for recruiters which assists them with pre-hire decision making and reduces the hiring lifecycle from a few weeks to a few minutes. It assists a recruiter by parsing resume data, quantifying candidate metrics, calculating relevance against a job description and ranking candidates by a metric called employability score. First, an exhaustive database linking industries & functions to skill sets, companies, job positions and colleges was created by using natural language processing over a database of half a million resumes documents (without any specific template). This database was then utilized to identify qualification, skills and experience information from user resumes via a parser. This was coupled with a chatbot to collect missing candidate information directly. Next, a stacked model for filtering, relevance matching, and competitive ranking was developed. Candidates which were finally selected by the recruiter are captured and used as a feedback the self-learning algorithm to adjust parameter weights. The platform and algorithm are patented under iPredictt Data Science Labs.

Tools used: Python, NLTK, Expectation maximization, Gaussian mixture model, Gradient Boosting, PCA.

Jul 2016 – Aug 2017

Analysis of Political Affiliation and Sentiment over Social Media

Lead Data Scientist, iPredictt

The objective was to understand the sentiment of a popular Indian News Network with respect to different political parties over Twitter and Facebook and compare the sentiments of other competitor news networks against it. Tweepy & web scraping was used to pull data via Twitter and Facebook, followed by data cleaning, feature generation, and NLP treatment to generate a sentiment report. The sectors of analysis included comparing political party affiliation, quantifying shared sentiment across newsgroups, detecting targeted negative propaganda over social media and forecasting topic-wise sentiment over Twitter.

Tools used: Python, Tweepy, NLTK, Topic Modelling, Sentiment Analysis.

Jan 2016 – May 2016

Optimize Ad Exchange networks for increasing campaign value

Lead Data Scientist, iPredictt

The objective was to create a platform for a 60cr turnover Mobile Ad Exchange startup to optimize ad campaign time and direction which involves selecting the right publisher for the advertising campaign as a factor of time of the day, conversion rates, customer target category and network type. Variable importance calculated via Decision trees to categorize publisher efficiency and thus analyze trends better, while click probability for cookie ids was calculated by building a logistic model. The campaign statistics were visualized using charts and Sankey diagrams over an R-Shiny server.

Tools used: R, R-Shiny, Decision trees, Random Forest, Logistic regression.

Jul 2015 – Dec 2015

Supply Chain network optimization and planning

Senior Decision Scientist, Mu-Sigma

Client was a fortune 50 multinational computer technology giant. The project objective was to analyze backlogs and develop a network flow optimization model for Americas, EMEIA and Asia logistics team to enhance the efficiency of respective supply chains. A model was built on 3 years of backlog data with stage-wise & SKU-wise flow's starting from Manufacturing to Fulfillment Centers/Customers. Missing data were imputed using decision trees followed by Linear programming to minimize the objective function of the number of backlogs in each network. The resulting model was visualized using Tableau and shared with 1,000+ stakeholders and executives from Singapore, Austin, Hong Kong, London, Korea and India offices.

Tools used: SQL, R, Decision Trees, Linear Programming, Tableau.

Nov 2014 – Mar 2015

Theoretical Win prediction for customers of a Casino Giant

Senior Decision Scientist, Mu-Sigma

A Fortune 500 Casino Giant used certain business rules to calculate ADT (Accumulated daily theoretical win) for each of their customers to decide the category of their marketing spend which had an extremely low accuracy (32%). The objective was to build a regression model to predict ADT values for customers based on gambling spends, wins and other visit information. An ensemble model was created based on analysis of variation in the test variable (ADT). A certain segment of the customer population (which was primarily low spend customers) was tackled using generalized linear models while remaining segment (which comprised primarily of high spend customers) was tackled using 11 separate Support Vector Machine classification models. The test variable for these was bucketed into spend categories instead of using a continuous ADT value. The accuracy of this model was much higher than the base model (53%). The exercise was followed by creating a financial modeling simulator using these predictions to generate best and worst case profit/loss scenarios over variable marketing spends.

Tools used: Python, ANOVA, K-means clustering, Support vector machines, Monte-Carlo simulation.

Jun 2014 – Oct 2014

Driver analysis for market cannibalisation

Decision Scientist, Mu-Sigma

The 2nd largest toy manufacturer brand showed quarter on quarter ROI decline of 20% which amplified further during the latest holiday season. Clear understanding was required on what were the prime causes of this decline. A five dimension deterministic model was created to analyze parameters calculated through web analytics. This model was then passed through regression analysis for generating estimates for each parameter as a substitute for driver towards the sales decline. A major realization by the end of the exercise was that the decline was primarily due to cannibalization by a fresh brand they launched themselves but for a higher age category. This allowed them to take major decisions in time to stabilize the curve to around 8% decline in the coming quarter and also affected the launch dates of their upcoming brands.

Tools used: R, Deterministic modeling, Web analytics, Generalized regression models.

Dec 2013 – May 2014

Customer Segmentation and Targeting for retail products

Decision Scientist, Mu-Sigma

Client was the world's biggest home improvement retail company. The objective was to create customer segments based on their behavioral traits, spend patterns and volatility in purchase categories which would allow the client to understand and target better. Customer segmentation based on transaction data was done using RFM segmentation followed by item-based and user-based collaborative filters to create purchase category recommendations for customized targeting. This directly affected client’s top line for specific departments such as gardening and home repair.

Tools used: SQL, Excel, R, RFM Segmentation, Collaborative Filtering.

May 2013 – Nov 2013

Real Time in-store traffic analysis using Brickstream

Decision Scientist, Mu-Sigma

Client was the world's biggest home improvement retail company. They were on a pilot with Brickstream, which is a Video analytics software which uses aisle camera footage to virtually create trip lines and dwell zones. Exhaustive reports were created for the data collected by Bricksteam enabled cameras on a weekly level. Trip line analysis allowed client to predict traffic hours in real time and accordingly align their store associates for coming days/weeks, thereby improving resource management. Dwell analysis enabled the client to understand dwell times of customers at specific aisles positions thereby enabling them to take decisions on shelf space management.

Tools used: Brickstream, SQL, R, Video Processing

Nov 2012 – Apr 2013

Naive Bayes and its different classifiers, How do they work?

Mumbai, 22nd Jan 2019
#DataScience #NaiveBayes #Notebook

In this notebook I try to explore the intuition behind the probabilistic modelling techinque called Naive Bayes and its various classifiers. Understanding the implementation of each of these classifiers is important as they come with their own assumptions over the naive assumption of the algorithm itself. An deeper look at the implementation of these classifiers and the mathematics behind them can shine more light to the intuition behind this widely used algorithm that forms a foundation for a large number of other more complex methods in supervised classification....



Handling missing data (like a boss!)

New Delhi, 28th Jun 2017
#DataScience #MissingData #Notebook

Missing data is the nemesis of every data scientist, specially if they are still new to the field. We are all facinated by new algorithms and don't miss a single chance to apply them on every dataset we can get our hands on. But, alas, missing data becomes a major barrier between that dream, unless it can be handled properly. I learnt my lesson long back and while I know people who are wizards at data handling (one function to rule them all), I fortunately/unfortunately prefer simple short steps to handle missing data so as to verify each step. With what little python programming I can muster, I present to you the standard and advance Missing Data handling techniques I have learnt over the years....



Visualizations for EDA in Python

New Delhi, 15th Apr 2017
#DataScience #Visualizations #Seaborn #Notebook

Distributions, correlations and data variability are some of the most important tasks before feature engineering begins. One may consider these as the first major project stage a Data Scientist needs to be able to perform. A thorough exploration has not only helped me understand the data at hand, but also form basic notions about the ballparks and behavior of given features. Ballparks help a data scientist avoid logical errors during feature engineering. In this notebook I detail the most effective way I found to generate charts in a standardized way for understanding data during the exploration phase of the project. These primarily utilize the grid method to create a canvas for then plotting charts by categories. In a sense, its similar to data grouping in a visual way....



Group by & Aggregate using Pandas

New Delhi, 25th Mar 2017
#DataScience #DataGrouping #Notebook

Data Grouping is probably the most used concept in the field of data analysis. Almost every scripting language builds its foundation over grouping data by categories of a multi-dimensional variable. A data scientist uses this for summarizing data for analysis as well as changing the level at which data can be useful for a model. Example, transaction level data needs to be summarized at customer level data before predicting their spend. Usecases like these are where languages like SQL are very useful with their group by clauses. However python isn't too far behind. Pandas provides a large variety of methods which do so much more than the standard SQL grouping. This combined with the aggregate methods gives a Data Scientist a strong grasp over data handling....