Sujay Talanki’s Data Science Portfolio

LLM-Powered Form Completion Engine

Engineered a RAG pipeline using Llama3 8B LLM to automate maintenance request processing via an interactive form completion engine.
Leveraged GPT-4o to generate synthetic data.
RAG pipeline employs custom document object, VectorStoreIndex, QueryEngine, and Retriever.
Tools used: Python (PyTorch, Replicate, OpenAI API, LlamaIndex), Llama3 8b, Mistral 7b.
NOTE: Project isn’t complete as the feedback feature hasn’t been fully implemented and the Streamlit app code hasn’t been fully developed. The real data was removed for privacy reasons.

Aerial Object Detection: Differentiating Drones, Birds, and Airplanes

Curated a diverse dataset of 2000+ drone, bird, and airplane images/videos from various sources.
Augmented the dataset with varied lighting conditions, angles, cropping, shear, and background noise to simulate real-world conditions.
Fine-tuned a YOLOv8 model that achieved 91% precision, 89% recall, and 89% mAP for all classes
Utilized RayTune to search for optimal hyperparameters and MLflow to track the results for different model configurations
Validated the model using a test set comprising unseen aerial footage from different environments.
Tools: Python (Cv2, PyTorch) YOLOv8 (Computer Vision model), Roboflow (Data preprocessing, augmentation, and annotation service)

Prediction on Test Set Images:

Confusion Matrix:

F1 vs. Confidence Curve:

Enhancing NanoGPT via Squentropy Loss and Hyperparameter Tuning

Report of Findings, Website, Poster
Formulated a custom and hybrid squentropy loss function to minimize empirical risk via PyTorch methods
Developed script to calculate model perplexity (metric that determines how “perplexed” a model is when predicting its next token; lower perplexity indicates better performance)
Conducted large-scale hyperparameter tuning via distributed data parallel methods to attain 1.8 loss and 3.9 perplexity
- Hyperparameters tuned include learning rate, dropout rate, and number of layers
Tools: Python (Numpy, Pytorch, Pickle), AWS

Poster:

Formula and Code:

Top 10 Property Essentials

Background: The company had many properties, each of which had 10 areas that inspectors were required to assess; inspectors left comments regarding the condition of each area.
Result: Designed a Power BI dashboard using Python, Snowflake, and unsupervised learning algorithms to analyze 20,000+ facilities, communicating insights to non-technical stakeholders.
Applied key phrase extraction, sentiment analysis classifier, and visualizations to uncover insights from 5,000+ inspector comments, enabling data-driven decisions and driving business improvements.
Analysts can filter the results to analyze performance based on the market, region, property type, property manager, etc. in order to visualize the performance of select facilities.
Tools used: Python (Numpy, Pandas, Nltk, Sentence_transformers, Sklearn), Snowflake (SQL), Power Query, and Power BI.

Workflow Diagrams:

Diagram of sentiment encoding process

Diagram of key phrase extraction process

Power BI Dashboard:
NOTE: I don’t have access to the data warehouse anymore, so there is an error when using the Power BI dashboard (PowerBIDashboard.png). However, I have attached an image of what the missing visuals should look like (check App.png)

Power BI Dashboard with error

Image of the missing Power BI visual (created using Plotly and Dash packages)

Operational Expense (OpEx) Variance Analysis

Background: The company had budgets for each operational expense (OpEx). If the actual expense for the quarter missed the budget, an accountant would leave a comment for the reason behind the “variance” (budget miss). I was tasked with categorizing the general reasons behind these variances.
Result: Applied unsupervised learning algorithms (NLP and Clustering) to create 10 evenly distributed categories that classified 20,000+ OpEx variance comments. This project allowed accountants and analysts. to understand/analyze the reasons behind OpEx variances to prevent occurances in the future!
Reduced workload significantly and achieved ~ 78% accuracy on hold-out set.
Developed a python pipeline (pandas, nltk, sentence_transformers, sklearn) that embedded sentences into vectors (BERT LLM) and utilized K-Means clustering to categorize OpEx variance comments.
Launched a local app (python: plotly and dash) to visualize the key phrases per category. This was used to help analysts summarize the content of the 20,000+ comments.

Workflow Diagram:

Movie Recommender System

Engineered a content-based recommender system using NLP that takes in a movie and recommends 10 similar movies based on movie attributes and rating.
Considered movies’ description, cast, director, genre, and average rating to suggest movies that the user would enjoy.
Conducted preprocessing and feature engineering to ensure data quality and enhance model performance.
Employed CountVectorizer and cosine similarity metric to suggest many similar movies. Retrieved the top 10 highest rated movies from this selection to suggest to the user.
Tools used: Python (Numpy, Pandas, Nltk, Sklearn).

Flight Delays Prediction

Report of Findings
Constructed a machine learning pipeline to predict the severity of a flight’s delay (1M+ flights).
Performed EDA, feature engineering, cross validation, and hyperparameter tuning (GridSearchCV) to optimize an XGBoost model that achieved 94% accuracy.
Tools used: Python (Numpy, Pandas, Sklearn, LightGBM, XGBoost, GridSearchCV).

Report of Findings:

Metrics:

Netflix Stock Price Prediction

Assembled an LSTM Neural Network to predict the closing price of Netflix stock using the last 60 days of time series data.
LSTM Neural Network contained 3 LSTM layers of 50 neurons each, followed by 3 drouput layers (20%) and 1 dense layer.
Tuned the batch size to 32 and number of training epochs to 50 in order to acheive optimal performance (~ 4.5 RMSE).
Tools used: Python (Numpy, Pandas, Matplotlib, TensorFlow, Sklearn).

Prediction:

Lead Scoring Model

Developed a lead scoring model using supervised machine learning algorithms to predict the probability that a loan (“lead”) is funded with 93% accuracy.
After performing cross validation and comparing with several models, the LightGBM classifier performed optimally.
Led to a simulated profit of ~ $300,000 (in our scenario, assuming a converted lead yields $120 and the time/effort costs around $15).
Tools used: Python (Numpy, Pandas, PyCaret, LightGBM, Sklearn).

ROC Curve:

Sujay Talanki's Data Science Portfolio

Data Scientist | Interested in Predictive Modeling, Deep Learning, and Cloud Computing