Fake Job prediction with Machine Learning

João Pedro dos Santos
Dev Genius
Published in
5 min readFeb 10, 2022

--

In Machine Learning, classifications problems probably are the first that people learn when studying the field for the first time. I am starting my journey as Data Scientist, so for practice, today we are going to use the famous library for Machine Learning problems which provides dozen of built-in models, Scikit-learn. For our data, we are going to use the Fake Jobs dataset from Kaggle to see some of my learning so far. The idea is to predict if some posted jobs on the internet are real or fake. We are going to cover the basics of data analysis, data cleaning, modeling, and techniques to improve hyperparameters.

So let´s get started, our first step is to import some packages that will be useful to us during our process with the data and download the dataset from Kaggle.

# Import tools we needimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# Import datadata = pd.read_csv("fake_job_postings.csv")data.head()

As we can see, our data has some NaN values, and this is a big problem. Let’s verify how many columns have NaN values.

data.isna().sum()

We can solve this with a range of techniques, and one of them is to fill missing data with zeros.

data = data.fillna(0)

As we can see on the graphic below, we have a big discrepancy in the amount of data for each label, around 20 times more data for label “0 — Real Job” if compared to “1 — Fake Job”.

There’s a variety of techniques to solve this issue, and one of them is called “unsample”, which would be the process of decreasing the number of samples from label 0 until be at the same size as label 1.

# Shuffle the Datasetshuffled_df = data.sample(frac=1,random_state=4)# Get only fraudulent valuesfraudulent_df = shuffled_df.loc[shuffled_df["fraudulent"] == 1]# Get only non-fraudulent valuesnon_fraudulent_df = shuffled_df.loc[shuffled_df["fraudulent"] == 0].sample(n=len(fraudulent_df),random_state=42)# Concat the fraudulent and non-fraudulent values in one datasetdf = pd.concat([fraudulent_df, non_fraudulent_df])

We have already solved two problems in our data, but there’s still one left. Some columns of our data are not numerical, and this will be a problem for training our machine learning model soon.

df.head().info()

We can see that non-numerical data have the type “object”.

Let’s turn our data into the categorical format, which will generate some “codes” for each different value in columns.

# Turn into categorical 
for label, content in df.items():
if pd.api.types.is_string_dtype(content):df[label] = content.astype("category").cat.as_ordered()# Turn categorical variables into numbersfor label, content in df.items():if not pd.api.types.is_numeric_dtype(content):df[label] = pd.Categorical(content).codes+1

As we can see, our data is in the numerical format right now, and we can model after splitting our data into a train and test set.

# Split data into train and test setX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

For the modeling part, let’s start checking which algorithms performs the best in our data, to after that choose the best and tune it to get the best accuracy possible.

# Put models in a dictionarymodels = {"KNN": KNeighborsClassifier(),"LinearSVC": LinearSVC(),"NaiveBayes": GaussianNB(),"Logistic Regression": LogisticRegression(),"Random Forest": RandomForestClassifier()}
# Create function to fit and score modelsdef fit_and_score(models, X_train, X_test, y_train, y_test):# Random seed for reproducible resultsnp.random.seed(42)# Make a list to keep model scoresmodel_scores = {}# Loop through modelsfor name, model in models.items():# Fit the model to the datamodel.fit(X_train, y_train)# Evaluate the model and append its score to model_scoresmodel_scores[name] = model.score(X_test, y_test)return model_scores
fit_and_score(models=models, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test)

We got around 90% of accuracy in our test data with RandomForestClassifier, which is pretty good. But we can still tune the model hyperparameters to see if performs better in our data. There are a lot of ways for hyperparameter tuning, but we are going to use RandomizedSearchCV, which lets us provide a range of values for hyperparameters and after that randomly choose which one performs the best.

from sklearn.model_selection import RandomizedSearchCVn_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]max_features = ['auto', 'sqrt']max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]max_depth.append(None)min_samples_split = [2, 5, 10]min_samples_leaf = [1, 2, 4]random_grid = {'n_estimators': n_estimators,'max_features': max_features,'max_depth': max_depth,'min_samples_split': min_samples_split,'min_samples_leaf': min_samples_leaf}rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)rf_random = rf_random.fit(X_train, y_train)# Best random modelpreds_best_random = rf_random.predict(X_test)print(f"Best model accuracy: {accuracy_score(y_test, preds_best_random)}")

We got almost 92% of accuracy!!! This is good, if I had used more hyperparameters in RandomizedSearchCV could’ve been better or worst, it depends a lot from problem to problem.

I finished my first full machine learning implementation tutorial/process here on medium, sounds good…

Resources that helped me a lot:

Complete Machine Learning & Data Science Bootcamp 2022

Hyperparameter Tuning the Random Forest in Python

--

--