This repository houses a compilation of personal projects that I have accomplished during my CS Masters program and leisure time.
This repository contains personal projects I did in my free time and during my CS Masters's degree.
This directory encompasses my entry for the Kaggle Store Sales - Time Series Forecasting competition. This submission showcases a comprehensive approach to predicting store sales using Corporación Favorita’s Ecuadorian grocery retail data. Leveraging TensorFlow’s DNN capabilities, the project centers on refining time series forecasting skills. Beginning with extensive Exploratory Data Analysis (EDA) and meticulous data preparation, the initial phase established a foundational XGBoost model for feature importance and performance benchmarking. Subsequently, the focus shifted to univariate time series forecasting, adapting the data for deep learning models incorporating LSTM and CNN layers within Tensor. The highlight emerged with a hybrid DNN model, where the LSTM-CNN architecture showcased superiority over the XGBoost model. This hybrid model achieved 0.087 RMSLE and 0.86 R-squared on the global validation set and 0.9 RMSLE on the competition’s leaderboard. The results underscore the efficacy of the LSTM CNN hybrid model in forecasting time series. I am pleased to share that this notebook is publicly accessible on Kaggle and welcomes comments and feedback from the community.
This directory encompasses my entry for the Kaggle Natural Language Processing with Disaster Tweets competition. The challenge revolved around constructing a machine learning model adept at distinguishing genuine disaster-related tweets from others. My primary focus throughout this competition was to enhance my proficiency in leveraging TensorFlow for training Deep Neural Networks specifically tailored for natural language processing in text classification. I accomplished a commendable accuracy surpassing 80% with my model. This notebook chronicles my meticulous dataset preprocessing, specifically tailored to suit Deep Neural Network (DNN) training. Initial experimentation involved employing Long Short-Term Memory (LSTM) and training the embedding layer within the network. However, due to the limited size of the training data, this approach led to overfitting. To address this issue, I integrated a 200-dimensional Twitter variant of Stanford’s GloVe embeddings. I explored diverse architectures and ultimately opted for a straightforward LSTM-CNN hybrid model. To further enhance the model’s performance, I fine-tuned the hyperparameters of the LSTM-CNN hybrid using Bayesian Optimization with Keras Tuner. I’m delighted to share that this notebook is openly accessible on Kaggle, and I eagerly invite comments and feedback from the community.
This folder contains my submission for the Kaggle Digit Recognizer competition, which entailed the accurate identification of digits from a dataset comprising tens of thousands of handwritten images. My primary objective in this competition was to refine my skills in utilizing TensorFlow for training Deep Convolutional Neural Networks (CNNs) in the context of image classification tasks. I successfully achieved an accuracy exceeding 99% with my model, securing a position in the top 30% of participants. My network architecture consisted of four convolutional layers and two fully connected layers, inclusive of the output layer. To address the classification challenge, I employed one-hot encoding for the labels, employed categorical cross-entropy as the loss function, and made predictions based on the class with the highest probability. After conducting empirical testing, I opted for the Adam optimizer over RMSProp due to its superior performance. To further enhance the robustness of the model, I integrated dropout and early stopping mechanisms to mitigate overfitting. Additionally, I introduced a learning rate decay on plateaus to achieve improved convergence during training. I am pleased to share that this notebook is publicly accessible on Kaggle and welcomes comments and feedback from the community.
This folder contains my submission for the Kaggle Forecasting Mini-Course Sales competition. This competition involved forecasting sales with synthetically generated time series datasets. Unfortunately, This post revealed that this dataset had data construction errors on the test set which is being used to calculate the leaderboard scores. Such that, all countries had different base levels for the target feature at the train set but the same base level in the test set. There was no way to predict this other than probing the public leaderboard and no way to adjust to this other than shifting predictions up and down to match flawed test data. I shifted my predictions because that would no longer be predictive modelling, but some contestants did to score high on the leaderboard. As a result, while the leaderboard scores might be unreliable in assessing the performance of the developed models, I enjoyed practicing time features engineering and experimenting with Prophet and CatBoost for forecasting with time-series data. This notebook is publicly available and open for comments on Kaggle
This folder contains my submission for the Kaggle ConnectX Reinforcement Learning competition. The ConnectX competition involves training an AI agent to play Connect4, and contestants are required to submit their agent in a Python file for evaluation on the leaderboard. This leaderboard is updated dynamically to include scores from all submissions. Within the ConnectX_LeaderScore_Apprx300.ipynb notebook,I have developed an AI agent that currently holds a position of approximately 300 on the leaderboard. The agent was trained using the Stable_baseline3 PPO algorithm for 1 million timesteps in the Gymnasium environment, with linear learning-rate decay. I have created a function that enables dynamic prediction of the agent’s actions by passing trained action network parameters (transforms, weights, biases) to an equivalent deep neural network developed using PyTorch that returns predictions. This notebook is publicly available and open for comments on Kaggle.
This folder contains notebooks I created for exploratory data analysis and visualisation of interesting datasets obtained from Kaggle. AI,ML,DS,BigData_Jobs_Analysis.ipynb: I analyzed the Scraped Data on AI, ML, DS & Big Data Jobs dataset to gather useful insights on job prospects in AI, ML, DS, and Big Data. This notebook is publicly available and open for comments on Kaggle.
This folder contains notebooks I submitted to the Kaggle Titanic Machine Learning competition. Titanic competition is about developing a model that makes binary classifications from tabular data of Titanic’s passengers. Kaggle_Titanic_Ensembles(Top_6%).ipynb was my first notebook, and I achieved a ranking within the top 6% with a 0.79425 leader score by using an Ensemble of SVC, RFC, XGBoost, and CatBoost classifiers. Kaggle_Titanic_Final_RFC_SVC_XGB.ipynb is my second notebook. I used RFC, XGB and SVC classifiers and my SVC in this notebook managed to get a 0.79186 leader score. I performed extensive data exploration with visualisation and feature engineering in my second notebook. I used a combination of statistical and wrapper methods for feature evaluation and selection. I utilised gridsearchcv with an exhaustive feature selection to finalise feature lists and determine starting parameters for the models. This notebook is publicly available and open for comments on Kaggle.
I completed this project as part of an assignment in one of my modules during my Masters. The project involved training an AI agent that plays the Flappy Bird game using deep reinforcement learning. In this project, I used Stable Baselines 3 to train a neural network to control the bird in the Flappy Bird environment. I used an actor-critic algorithm with (PPO) proximal policy optimisation to create the network to train a vector agent and an Image stack agent to play the flappy bird game. To train the vector agent, I used the FlappyBird-v0 environment and MlpPolicy, the other hyperparameters of the network, for training the PPO agent provided to us in the assignment. To train the image stack agent, I used FlappyBird-rgb-v0 environment but resized images to 64*128, changed to grey-scale and stacked four frames together, and I used CnnPolicy, the other hyperparameters of the network for training the PPO agent provided to us in the assignment. The number of trainable weights and biases in the vector agent model was 10,179, and in the image stack agent model was 4,955,619. Therefore, training took a very long time, and even after 500.000-time step models were undertrained, I didn’t continue training due to computational limitations. The best performance evaluated for 30 episodes after training for 500,000 timesteps was 11.83 mean award of image stack agent.
I created an interactive dashboard using Vega-Lite for visual storytelling as part of an assignment for my CS master’s degree. I aimed to allow users to explore the relationship between countries that deploy nuclear devices, the purposes of detonations, the methods of deployment, explosion yields, and locations of the explosions. To use the tool, download the HTML file in this folder, open it with the browser and enjoy using this interactive tool. The dataset used for this visualisation tool was adapted from the 2019 Tidy Tuesday Collection dataset. The record of my visual storytelling through the utilisation of this dashboard is currently accessible on youtube
I completed this project as part of an assignment in one of my modules during my Masters. The project involved building a deep convolutional neural network model to recognise monkey species from disparate images. The monkey species classification tasks from Kaggle involved identifying ten species of monkeys from disparate image types. My task was to build an accurate classification model for this scenario using different CNN models. I first trained the LeNet type CNN model, then replaced the convolutional layers with the VGG-16 model (pre-trained on the Image Net dataset) and improved accuracy—weighted average accuracy score: 0.72.
I completed this project as part of an assignment in one of my modules during my Masters. The project involved creating a gradient-boosting regressor for continuous prediction problems by inheriting from appropriate scikit-learn base classes. The implementation allowed the subsampling of rows in the samples and included a learning rate hyperparameter. I tested my self-built algorithm against sklearn Adaboost regressor on commonlitreadabilityprize dataset ‘s bag of words and sentence embedding representations. After tuning the hyperparameters for both models, my self-built gradient boosting regressor showed slightly better prediction performance than sklearn’s AdaBoost algorithm.
I created this notebook to gain insight into the common traits employers currently look for in data analytics roles by examining a set of job ads posted on Linkedin.
I implemented a Multi-Layer Neural Network from Scratch with Python for one of my modules during my Masters. The implemented network was a shallow multi-layer neural network of 1 hidden layer. The network could take any number of inputs, any number of outputs, and any number of hidden units. Learning implemented by backpropagation using batch gradient descent. The network is implemented to be flexible. It can handle regression and classification tasks and utilise different activation functions for hidden and output units depending on the user’s choice. In the classification task, the network used a leaky rectified linear unit activation function for hidden layer activation. For output layer activations, it used sigmoid activation functions for binary classification and softmax activation functions for multiclass classification. The network used a log loss cost function to observe training performance for classification. For regression, output wasn’t activated, and for the hidden layer, the user was given a choice to select from various activation functions: sigmoid function, hyperbolic tangent function, leaky rectified linear unit function, and binary step function. The network used the square error cost function to observe training performance for regression. Implemented network tested on two classification tasks, which were to predict XOR function, letter recognition and one regression task to approximate a sin function. Test results are available in separate notebooks. Source of the dataset for letter recognition
This is a combination of two projects I made for one of my modules during my Masters to gain practical experience in using big data management tools, Hadoop and Spark. To complete Hadoop tasks, I loaded three books to HDFS and wrote map-reduce tasks with Python to analyse books. To complete PySpark tasks, I analysed the wordle game dataset extracted from Twitter by utilising a sparks multiprocessing framework with RDDs and Dataframes.
I built Time-Related Feature Engineering Pipelines with Python to predict traffic volumes. I completed this project as part of an assignment in one of my modules during my Masters. I performed time-related feature engineering by encoding time features using cyclic_spline_transformer and used wrapper strategy with sequential forward selection for feature selection. I built a predictive model using Stochastic Gradient Descent regression with polynomial transformation and achieved 85% accuracy in predicting traffic volumes.
I completed these tasks as part of an assignment in one of my modules during my Masters. This folder contains two notebooks; In Data_Warehousing_practice.ipynb I created a database and made a set of OLAP queries to explore the data. Then I defined a fact constellation schema diagram for the data warehouse with facts, dimensions, and measures and addressed steps for specific OLAP operations to perform to answer a particular query. Then use PostgreSQL Python and its libraries and define a set of functions to operate the data warehouse. In Association_Rule_Mining_Practice.ipynb, I cleaned and transformed an online retail dataset for association rule mining, then mined association rules using Apriori and FP-growth algorithms.
I implemented a small database management system in Bash for one of my modules during my Masters. Data exchange between client and server for data entry and query provided through pipes. The system supported concurrent execution and protected against synchronisation problems with the effective use of semaphores. The system was able to create a database, create a table, insert rows in tables at existing databases and allowed to query the tables at targeted databases as a whole or with column indexes to display content.
This folder contains small scripts I write to solve various coding challenges in Codewars in my free time.