Projects

Insurance Risk Analysis Dashboard

Tools: Tableau, R, Statistical Analysis Run EDA and hypothesis testing in R and created an interactive dashboard analyzing insurance premium data to identify key demographic risk factors affecting policy pricing.

Key Achievements: -Analyzed 1338 insurance records -Identified high-risk customer segements -Provided actionable recommendations for premium pricing -Improved risk assessment accuracy through statistical modeling

Here is a visual walkthrough of my tableau Dashboard

Screenshot 1 Screenshot 2 Screenshot 3 Screenshot 4 Screenshot 5 Screenshot 6 Screenshot 7 Screenshot 8

Project Files

View R Script
View Dashboard	Github Repository

Customer Segmentation Dashboard

Tools: Tableau, R

Overview

K-Means clustering analysis to identify distinct customer segments based on spending score

Key findings

-Identified 6 distinct customer clusters -Premium Young Professionals show highest spending potential

Dashboard Features

-Interactive scatter plot visualization -Cluster specific insights

Gender based spending analysis
Summary with key metrics

Screenshots

Here is a visual walkthrough of my tableau Dashboard

Screenshot 1 Screenshot 2

Project Files

View Dashboard View R Script

Programming Language Popularity forecast

tools: Python

Overview

This project analyzes and forecasts the popularity trends of three major programming languages: C++, Java, and Python using time series forecasting techniques. The analysis employs SARIMA (Seasinal Autoregressive Integrated Moving Average) modeling to predict future popularity bases on historical data patterns.

Key Methodology

Data Preprocessing: Stationarity tests using Adfuller tests and data transformation

Seasonal Decomposition: Breaking downtime series into trend, seasonal, and residual components.

Model Selection: SARIMA(1,1,1)(1,1,1,52) was chosen for C++ and Python and SARIMA(2,1,1)(1,1,1,52) for Java bases on model diagonostics and performance metrics.

Validation: An analysis of residuals was conducted

Accuracy: Comparison of 12-week forecasts against actual observed values

Appropriateness of Model Chosen

Appropriateness of chosen forecasting method Python From the plot of fitted data , i see the model captures both the general trend and seasonal fluctuations effectively, the strong alignment between actual and fitted values indicates the model parameters are well calibrated

The plot of residuals shows that residuals fluctuate around zero throughout time period, indicating no systematic bias, The residuals appear to have relatively constant variance

The acf of residuals are approximately shows most values falling the confidence bands, This indicates the residuals are approximately white noise, confirming the model has adequately captured the temporal dependencies

Java From the plot of fitted data, we see that the fitted values track the original Java data

The residuals fluctuate randomly around zero, this supports model adequacy and suggests the SARIMA model has captured the underlying structure

The acf of residuals shows that there is no significant autocorrelation in residuals, this confirms that the model has adequately the temporal dependencies

model diagonistics support the chosen specification, ACF analysis validates the absence of remaining autocorrelation.

C++ The residual analysis plot shows values oscillating around zero with no clear, indicating good model fit

ACF of residuals: the autocorrelation values fall within confidence bands, confirming residuals are approximately white noise

The fitted vs actual comparison shows the model captures both trend and seasonal variations well

Project Files

View my python File

Online Retail Country Revenue Analysis (Hadoop & R)

Project Overview This project analyzes sales data from a UK-based online retail company (2009–2011) to evaluate revenue performance across countries. The goal is to support strategic international market expansion decisions using distributed data processing.

Dataset

Source: UCI / Kaggle Online Retail II Dataset

Records include invoice numbers, product codes, quantities, prices, and country of shipment.

Technologies Used

Hadoop (HDFS)

MapReduce (R)

R (data processing & visualization)

ggplot2

Methodology

Loaded raw transactional data into HDFS.

Cleaned data by removing:

Records with null country values

Records with non-positive quantities

Implemented a MapReduce job in R to compute total revenue per country.

Sorted results in descending order to identify top markets.

Visualized the top 5 countries by revenue using a bar chart.

Key Insights

Revenue is highly concentrated in a small number of European countries.

Top-performing markets represent strong candidates for further investment.

Lower-performing countries may require targeted marketing or operational review.

Business Value This analysis demonstrates how distributed computing can support data-driven international expansion strategies in retail.

Screenshots

Screenshot 1

Project Files

View my clean data File View my mapper File View my Reducer File View my visualisation File

Product Demand Categorisation Using Hive

Objective To classify products based on demand levels using distributed SQL analytics in Hive, supporting inventory and supply chain decision-making.

Approach

Loaded cleaned transactional data into Hive using an external table schema.

Calculated total quantity sold per product using Hive aggregation queries.

Developed a custom Hive UDF to categorise products into:

High Demand

Medium Demand

Low Demand

Business Value

High-demand products can be prioritised for restocking.

Medium-demand products can be supported through targeted promotions.

Low-demand products can be reviewed for discontinuation or bundling strategies.

Key Skills Demonstrated

Hive DDL & querying

User Defined Functions (UDFs)

Big data analytics

Inventory strategy support
View my project File

Customer Feedback Text Pattern Analysis (Pig & Hive)

Project Overview This project analyzes customer review text from Amazon products to uncover recurring complaint patterns and satisfaction drivers. The analysis focuses on identifying frequently occurring keywords in non-neutral reviews to support product and customer experience improvements.

Dataset

Source: Amazon Product Review Sample (Kaggle)

Fields include review title, review text, product name, and star rating.

Technologies Used

Hadoop (HDFS)

Apache Pig

Hive

Methodology

Loaded raw review data into HDFS.

Used Apache Pig to extract review titles and filter out neutral reviews (rating = 3).

Tokenised review titles and computed word frequencies to identify common themes.

Loaded processed outputs into Hive and created a view of the top 10 most frequent words.

Key Insights

Frequently occurring terms highlight common customer pain points such as product quality, packaging, or usability.

Text pattern analysis provides scalable insight into customer sentiment without manual review.

Business Value This approach enables marketing and product teams to prioritise improvements based on real customer feedback trends, improving satisfaction and reducing returns.

Project Files

View my filtered reviews pig file View my filtered reviews output file View my word frequency output file View my project File

Customer Retention Modelling with PostgreSQL and MADlib

Project Overview This project builds a customer churn prediction model for a telecom provider using PostgreSQL and MADlib. The objective is to predict customer churn using in-database machine learning and translate results into actionable retention strategies.

Dataset

Source: Telco Customer Churn Dataset (Kaggle)

Features include customer demographics, account details, service usage, and churn status.

Technologies Used

PostgreSQL

MADlib

Logistic Regression

Methodology

Imported raw churn data into PostgreSQL and created a cleaned analytical table.

Trained a logistic regression model using MADlib with tenure, contract type, and monthly charges as predictors.

Evaluated model performance using accuracy metrics and confusion matrix queries.

Interpreted model coefficients to identify churn drivers.

Key Insights

Short-tenure customers on flexible contracts exhibit higher churn risk.

Higher monthly charges are associated with increased likelihood of churn.

Long-term contracts significantly reduce churn probability.

Business Value This approach enables scalable, database-native churn prediction, allowing analysts to update models and generate predictions directly on live customer data.

View my filtered reviews pig file

Forecasting Monthly Tourist Arrivals Using SARIMA

Project Overview This project forecasts monthly tourist arrivals for SafariVista Lodge using time series methods to support operational planning. The analysis compares Holt-Winters and SARIMA models to determine the most suitable forecasting approach for strongly seasonal tourism data.

Dataset

Monthly tourist arrivals from January 2001 to December 2020

Variables: Date, Number of Visits

Methodology

Parsed and indexed time series data.

Performed seasonal decomposition to identify trend and recurring seasonal patterns.

Implemented Holt-Winters (Triple Exponential Smoothing) as a baseline model.

Tested for stationarity and fitted a Seasonal ARIMA (SARIMA) model.

Compared models using forecast accuracy metrics and residual diagnostics.

Generated 12-month forecasts for operational planning.

Key Findings

The data exhibits strong annual seasonality and a non-stationary trend.

Holt-Winters captures seasonality but underperforms during structural changes.

SARIMA provides superior forecast accuracy and better residual diagnostics, making it more suitable for this dataset.

Business Value The SARIMA forecasts enable improved staffing schedules, supply ordering, and activity planning by anticipating seasonal fluctuations in tourist demand

View my python file

Stock Market Volatility Forecasting Using ARCH and GARCH

Project Overview This project models and forecasts stock return volatility for EduNext Technologies using ARCH and GARCH models. The objective is to quantify market risk and support risk-adjusted investment decisions under market uncertainty.

Dataset

Daily stock prices including open, close, high, low, and trading volume

Log returns computed from closing prices

Methodology

Tested stock prices for random walk behaviour and detrended the series where required.

Transformed prices into log returns to ensure stationarity.

Tested for ARCH effects to detect time-varying volatility.

Fitted ARCH(1) and ARCH(2) models to capture short-term volatility dependence.

Implemented GARCH(1,1) to model persistent volatility clustering.

Evaluated models using Ljung–Box and Engle ARCH diagnostics.

Key Findings

Stock returns exhibit volatility clustering, a hallmark of financial time series.

ARCH models capture short-term variance dynamics but decay too quickly.

GARCH(1,1) outperforms ARCH models by capturing long-run volatility persistence.

Business Value Accurate volatility forecasts enable better risk management, pricing of financial instruments, and informed investment strategy decisions.

View my python file

Course Recommendation System Using NLP and Machine Learning

Project Overview This project develops a text-based course recommender system that suggests suitable career paths based on a learner’s written narrative. The system leverages natural language processing and machine learning to align user interests with relevant course categories.

Dataset

Career and course descriptions

Labeled course categories

Technologies Used

Python

NLP (text preprocessing, TF-IDF)

Scikit-learn

Machine Learning classifiers

Cosine Similarity

Methodology

Cleaned and denoised raw text data using standard NLP preprocessing techniques.

Transformed textual data into numerical features using TF-IDF vectorization.

Addressed class imbalance using resampling techniques.

Trained multiple ML models using 5-fold cross-validation.

Evaluated models using accuracy, precision, recall, and F1-score.

Selected the best-performing model based on F1-score.

Implemented a cosine similarity-based recommender to suggest the top 10 relevant courses from user input.

Key Findings

Proper text preprocessing significantly improves classification performance.

Balancing datasets improves recall for underrepresented course categories.

Cosine similarity effectively matches learner narratives to relevant career paths.

Business Value This system enables scalable, data-driven career guidance, supporting learners in making informed education investment decisions.

View my python file

News Topic Modelling Using Web Scraping and NLP

Project Overview This project analyses trending news topics by scraping articles from News24 and applying natural language processing and topic modelling techniques. The objective is to identify latent themes in current news coverage and understand topic distributions across articles.

Dataset

Ten scraped News24 articles

Fields: article ID, title, raw content, cleaned content, sentiment polarity

Technologies Used

Python

Web scraping (BeautifulSoup / Requests)

NLP (text preprocessing, sentiment analysis)

TF & TF-IDF

Latent Dirichlet Allocation (LDA)

pyLDAvis

Methodology

Scraped news article URLs and built a structured dataset.

Cleaned and denoised article text and computed sentiment polarity.

Performed term frequency analysis and generated word clouds.

Built TF-IDF matrices to identify salient terms.

Applied LDA topic modelling to extract four dominant topics.

Visualised topic distributions and relationships using pyLDAvis.

Key Insights

Topic modelling reveals dominant themes such as politics, economy, crime, and social issues.

TF-IDF highlights distinctive vocabulary separating topics.

pyLDAvis improves interpretability of topic overlap and salience.

Business Value This analysis enables media organisations to track trending topics, guide editorial focus, and support content strategy using automated text analytics.

View my python file