This repository contains a revised and modular version of my final project for IST 418: Big Data Analytics at Syracuse University. Originally implemented as a single notebook, the codebase has now been refactored into multiple notebooks for improved readability and modularity.
The project focuses on segmenting retail customers using demographic and behavioral data to enhance marketing strategies through clustering and recommendation systems. By leveraging PySpark, K-Means Clustering, FP-Growth, and a hybrid recommendation algorithm, this project transforms raw customer data into actionable insights for personalized marketing and inventory management.
-
Code/customer_segmentation.ipynb: final pipeline- clustering + recommendationsEDA.ipynb: exploratory data analysis and visualizationspreprocessing.ipynb: data cleaning and feature engineeringartifacts/- Saved Spark model artifacts
-
Data/marketing_campaign.csv(original dataset)preprocessed_segmenting_data_csv/(output CSVs from PySpark)
-
requirements.txt: Packages needed to run code
-
Segment customers based on demographic and behavioral traits using K-Means clustering
-
Visualize and interpret clusters using PCA
-
Develop recommendation systems to:
-
Suggest products to customers based on personal and group behavior
-
Recommend deals tailored to cluster characteristics
-
Highlight high-demand items using FP-Growth
-
-
Provide actionable marketing and inventory insights
The dataset was sourced from Kaggle, originally published by Dr. Omar Romero-Hernandez. It contains customer information from a Portuguese retail campaign, with 2,240 observations and 29 features, including:
-
Demographic: Age, Marital Status, Education, Income
-
Behavioral: Purchase history by product category, deal acceptance, customer loyalty
-
Engagement: Days since last purchase, enrollment date
This project was developed using Python 3.11.0
-
Clone the Repository
- Run the following commands in your terminal:
- git clone https://github.com/your-username/customer-segmentation.git
- cd customer-segmentation
-
Set Up Your Environment
- Install required Python packages:
pip install -r requirements.txt
-
Configure Java and PySpark
-
PySpark requires Java 17+ and Apache Spark.
-
Install Java 17 using Homebrew:
brew install openjdk@17 -
Add the following lines to your shell config file (
~/.zshrcor~/.bash_profile):export JAVA_HOME="/opt/homebrew/opt/openjdk@17"export PATH="$JAVA_HOME/bin:$PATH"
-
Install Apache Spark:
brew install apache-spark -
Add Spark to your shell config file (
~/.zshrcor~/.bash_profile):export SPARK_HOME="/opt/homebrew/Cellar/apache-spark/4.0.0/libexec"export PATH="$SPARK_HOME/bin:$PATH"
-
Apply changes:
source ~/.zshrcorsource ~/.bash_profile -
Verify installation:
spark-submit --version
-
-
preprocessing.ipynb: Cleans and transforms the raw data, performs feature engineering, and exports preprocessed data. -
EDA.ipynb: Generates visualizations and summary statistics to explore customer patterns. -
customer_segmentation.ipynb: Runs K-Means clustering, evaluates models with PCA and silhouette scores, and implements three recommendation algorithms.
-
Customers were segmented into three groups based on behavior and demographics:
-
Balanced Mid-Spenders
-
Budget-Conscious Deal Seekers
-
Affluent Digital Loyalists
-
-
Association Rule Mining highlighted that wine and meat are core product drivers.
-
Cluster-based and hybrid recommendations allow for personalized marketing strategies.
-
Tailored deal recommendations can enhance customer engagement and loyalty.
- pyspark
- matplotlib
- pandas
- seaborn
Tyler Katz
B.S. in Applied Data Analytics, Class of 2026 Syracuse University
This projest is licensed under the MIT Licesne. See the LICESNE for details.