ML Process Data Quality

nicodemusnaisau
6 min readSep 2, 2023

--

Hello, everyone! Welcome back to my Medium blog. Today, I’m excited to delve into the intricate world of machine learning, taking you through the comprehensive journey of developing a machine learning model, from inception to its practical application for users. In this project, I’ll be building upon my previous work, so if you haven’t had a chance to read about it yet, I highly recommend checking it out here. In this new venture, our primary goal is to classify data quality, an essential aspect of data analysis and machine learning.

Background Projects

Data quality refers to the accuracy, consistency, completeness, reliability, and relevance of data within a dataset. Inaccurate, incomplete, or inconsistent data can lead to flawed analysis and misguided decision-making. Poor data quality can arise from various sources such as people, systems, and processes (data entry errors, system glitches, SOP, and more. Ensuring data quality is crucial for obtaining meaningful insights from data analysis.

Data quality issues include:

  • Inaccurate Data
  • Missing Data
  • Inconsistent Data
  • Duplicate Data
  • Outliers
  • Invalid Data

The goal of this project is to develop a machine learning-based solution that automatically identifies and rectifies data quality issues within a company’s dataset, ensuring that the data adheres to predefined rules and governance standards. By addressing data quality concerns, the project aims to enhance the accuracy and reliability of insights derived from data analysis, ultimately leading to more informed and effective decision-making processes.

Problem Statement: In the context of this project, we are presented with a dataset containing various attributes related to individuals, such as gender, industry, age, education, marital status, and more. The dataset also includes an “Error” label indicating whether there are errors associated with the data for each individual. The goal of this project is to develop a predictive model that can accurately classify whether an individual’s data contains errors or not. The challenge lies in dealing with missing values, and imbalanced data, and selecting appropriate features to make accurate predictions.

Objective:

The main objective of this project is to build a machine-learning model that can predict whether an individual’s data contains errors or not.

Data Description

Range Index: The dataset has a total of 363,869 rows or entries, ranging from index 0 to 363,868.

Data columns (total 9 columns): The dataset consists of 9 columns in total.

  • GENDER: This column represents the gender of individuals.
  • Bidang_Usaha: This column corresponds to the industry or sector of individuals’ business or occupation.
  • Age: This column contains the age of individuals.
  • Month: This column represents a numeric value, possibly indicating a particular month.
  • Pekerjaan: This column corresponds to individuals’ occupations.
  • EDUCATION_CODE: This column likely represents an education code or level.
  • MARITAL_STATUS: This column represents the marital status of individuals.
  • CREATION_SOURCE: This column may indicate the source of creation or registration of the data.
  • Error: This column contains numeric values, possibly indicating errors associated with the data for each individual.

After completing the modeling phase in our last project, the next step on our journey is to transform our machine learning model into a user-friendly interface. In this endeavor, we have employed the decision tree algorithm. Are you familiar with decision trees?

image from Chirag Sehra on Medium

The Root Node: Where It All Begins

At the top of our decision tree, we have the “Age < 30.” This is like the first question your friend asks you when you’re deciding Is a Person fit??”

Branching Out: Making Choices

Now, picture branches sprouting from the root node. Each branch represents a possible decision based on the answer to that first question. If it’s < 30, you might go down one branch, and if it’s not, you’ll follow a different one. It’s like your friend saying, “If it’s yes, let's ask the next question “eats a lot of pizza”. If not, we’ll hit the fit”

Leaf Nodes: The Final Call

Eventually, you’ll reach the “Leaf Nodes.” These are like your final decisions. If all the questions lead you to the yes or no questions, that’s where you’ll end up.

But, here’s a little secret: decision trees can sometimes get carried away and overthink things, just like we can overcomplicate simple decisions. That’s why we have techniques to keep them in check, like pruning (trimming unnecessary branches) and fancy techniques like Random Forests and Gradient Boosting Trees.

our model has a 99% accuracy score. Now, the next step on our path by save it as a .pkl (Python Pickle) file. Additionally, we’ll explore how to reload it so we can harness its predictive prowess on entirely new data.

image by Author

After that, we create a defined field for modeling input

image by Author

Once we’ve defined each parameter and function using Streamlit, we can run the code using the following command:

streamlit run main_app.py

And then, we directly access the localhost using Streamlit, enabling us to make predictions on the data.”

For example, when we perform data prediction using those values, the prediction results may reveal data quality issues based on user input. This is particularly noticeable in the ‘age’ field, where outliers are detected. The model succeeds in identifying differences between data with good quality and data without.

In conclusion, our journey through this project underscores the utility of machine learning in identifying data quality issues. By following a systematic data preprocessing pipeline, conducting EDA, and leveraging appropriate modeling techniques, we were able to detect anomalies and data errors effectively.

Machine learning empowers us to not only make predictions but also serves as a valuable tool for data quality assessment and enhancement. Through its ability to recognize patterns and anomalies, it aids in ensuring that our data is reliable and suitable for building robust models. In essence, machine learning proves to be a versatile ally in the pursuit of high-quality data and more accurate predictions.

Future Works🚀

[1]Exploring Other Algorithms: While we focused on Decision Tree and Logistic Regression, there are numerous other algorithms available in the machine learning toolbox. Exploring models like Support Vector Machines, Neural Networks, or Gradient Boosting can provide a broader perspective on the problem.

[2]Ensemble Methods: Experimenting with ensemble methods, such as Random Forest or Gradient Boosting, could potentially further boost the predictive capabilities of our models. Ensemble methods combine multiple models to provide more robust and accurate predictions.

Github Repository Here

.

This project will be uploaded and deployed on AWS. Unfortunately, due to software and hardware limitations that prevent running containers with Docker, the project will be concluded with the implementation of Streamlit for prediction. i actively working on resolving this issues, a work in progress. thanks

Thank you for reading. I hope you gained valuable insights from this blog. I’m a work in progress, striving to improve further.

--

--

No responses yet