Transcription of: Fake Job Listing Detection Tutorial | Python Projects | Data Science Training | Edureka

[Music] there are a lot of job advertisements on the internet even on reputed job advertising sites which never seem fake but after selection the so called recruiters start asking for money and bank details many of the candidates fall into their trap and lose a lot of money and the current job sometimes so it is better to identify whether a job advertisement posted on the site is real or fake so let's learn to develop a project on fake job listing detection let's see the agenda for today's session we will first start with understanding the project and the project motive next we will understand what is natural language processing and how to create a natural language processing pipeline moving further we'll discuss algorithms that you can use for this project along with the environment and tools needed for the project then we will finally see how to execute this project kindly take up this time to subscribe to us and don't forget to hit the bell icon to never miss an update from the edureka youtube channel also if you wish to learn more about machine learning you can visit our course the link to which is given in the description box below so let's start with understanding the project identifying fake jobs manually is very difficult and almost impossible hence we can apply machine learning to train a model for fake job classification it can be trained using the data of previous real and fake job advertisements so that it can identify a fake job accurately in this project we'll first pre-process the raw data by tokenizing and removing the stop words this pre-process data will later be used to extract features using techniques like term frequency and inverse document frequency to train the model we'll be using a classification algorithm moving ahead let's see the project motor nowadays there are a lot of job scams because of unemployment there are a lot of websites that connect a recruiter to a suitable candidate sometimes fake recruiters post a job posting on job portal with a motive to get money this problem occurs with many job portals later people shift to a new portal in search of a real job but the fake recruiters join this portal as well hence in today's world it is important to detect real and fake jobs for this project we'll be making use of natural language processing let's understand what it is natural language processing or nlp is a part of computer science human language and artificial intelligence it is a technology that is used by machines to understand analyze manipulate and interpret human languages it helps developers to organize knowledge for performing tasks such as translation automatic summarization named entity recognition speech recognition relationship extraction and topic segmentation syntax and semantic analysis are two main techniques used with natural language processing syntax is the arrangement of words in a sentence to make grammatical sense nlp uses syntax to assess meaning from language based on grammatical rules syntax techniques used include parsing word segmentation sentence breaking and stemming semantics involves the use of meaning behind words nlp applies algorithms to understand the meaning and structure of sentences techniques that nlp uses with semantics include word sense disambiguation named entity recognition and natural language generation moving ahead let's understand how to build an nlp pipeline the first step is sentence segmentation sentence segmentation is the first step for building the nlp pipeline it breaks the phrase into separate sentences the second step is word tokenization word tokenizer is used to break the sentence into separate words or tokens the third step is stemming stemming is used to normalize words into its base form or root form for example celebrates celebrated and celebrating all these words are originated with a single root word celebrate the big problem with stemming is that sometimes it produces the root word which may not have any meaning the fourth step is limitization limitization is quite similar to stemming it is used to group different inflected form of word called lemma the main difference between stemming and limitization is that it produces the root word which has a meaning the fifth step is identifying stop words in english there are a lot of words that appear very frequently like is and the and a nlp pipelines will flag these words as top words stop words must be filtered out before doing any statistical analysis the sixth step is dependency parsing dependency parsing is used to find how all the words in the sentence are related to each other the seventh step is pos stats pos stands for parts of speech which includes noun verb adverb and adjective it indicates how a word functions with its meaning as well as grammatically within the sentence a word has one or more parts of speech based on the context in which it is used the eighth step is named entity recognition named entity recognition is the process of detecting the named entity such as person name movie name organization's name or location the ninth step is chunking chunking is used to collect the individual piece of information and grouping them into bigger pieces of sentences moving ahead let's see the algorithms you can use for this project the first algorithm we are going to discuss is support vector machine support vector machine is a supervised machine learning algorithm which can be used for both classification and regression challenges however it is mostly used in classification problems in support vector machine algorithm we plot each data item as a point in n-dimensional space with the value of each feature being the value of a particular coordinate then we perform classification by finding the hyperplane that differentiates the two classes very well let's understand it with an example as you can see in the figure the data set can be separated easily with the help of a line called a decision boundary but there can be several decision boundaries that can divide the data points without any error in this case we have boundaries a b and c but how to choose that one boundary that separates the data well here's the tip the best decision boundary is the one that has maximum distance from the nearest points to these two classes the region that defines the closest points around the decision boundary is known as the margin that is why the decision boundary of a support vector machine model is known as the maximum margin classifier or the maximum margin hyperplane but how do you deal with inseparable and non-linear planes in some cases hyperplanes are not very well efficient in those cases the support vector machine uses a kernel trick to transform the input into higher dimensional space with this it becomes easier to segregate the points let us take a look at the svm kernels an svm kernel basically adds more dimensions to the low dimensional space to make it easier to segregate the data it converts the inseparable problem into separable problem by adding more dimensions using the kernel trick a support vector machine is implemented in practice by a kernel the kernel trick helps to make more accurate classifier let us look at the different kernels in the support vector machine the first one is the linear kernel a linear kernel can be used as a normal dot product between any two given observations the product between two vectors is the sum of multiplication of each pair of input values the next one is the polynomial kernel it is rather a generalized form of linear kernel it can distinguish curves or non-linear input space the third one is radial basis function kernel the radial basis function kernel is commonly used in support vector machine classification the first one is linear kernel a linear kernel can be used as a normal dot product between any two given observations the product between the two vectors is the sum of multiplication of each pair of input values the second one is polynomial kernel it is a rather generalized form of linear kernel it can distinguish curved or non-linear input space the third one is radial basis function kernel the radial basis function kernel is commonly used in support vector machine classification it can map the space in infinite dimensions the second algorithm that you can use is random forest random forest is a flexible easy to use machine learning algorithm that produces a great result most of the time even without hyperparameter tuning it is also one of the most used algorithms because of its simplicity and diversity as the name suggests this algorithm randomly creates a forest and several trees generally the more trees in the forest the more robust the forest looks like similarly in random forest classifier the higher the number of trees in the forest greater is the accuracy of the results in simple words random forest builds multiple decision trees and glows them together to get a more accurate and stable prediction the forest it builds is a collection of decision tree trained with the bagging method let's understand this with an example so let's say you're looking to buy a house but you're unable to decide which one to buy so you consult a few agents and they give you a list of parameters that you should consider before buying a house the list includes price of the house locality number of bedrooms parking space and available facilities these parameters are known as predictor variables which are used to find a response variable random forest is an ensemble of decision tree it randomly selects a set of parameters and creates a decision tree for each set of chosen parameters take a look at the images given here there are three decision trees and each decision tree is taking only three parameters from the entire data set each decision tree predicts the outcome based on the respective predictor variable used in the tree and finally takes the average of the results from all the decision tree in random forest in simple words after creating multiple decision trees using this method each tree selects or votes the class in this case the decision tree will choose whether or not a house is bought and the class receiving the most votes by a simple majority is termed as the predicted class now let's see the environment and tools you need for this project the language we'll be using to develop this project is python now the reason for choosing python is because it is currently the most popular programming language for machine learning python code is understandable by humans which makes it easier to build models for machine learning the major libraries we'll be using include pandas numpy matplotlib scikit-learn and spacey panda stands for python data analysis library pandas can be used to import data and create a python object with rows and columns and can also be used to write data into a file using various commands pandas can be used for viewing selecting filtering and analyzing the data as well numpy consists of multi-dimensional array as well as matrix data structure it can be used to perform fourier transformation and mathematical operations on arrays such as statistical and algebraic operations matplotlib is an amazing visualization library it can be used to create interactive graphs charts and maps scikit-learn is a very beneficial library as it provides with a large number of useful tools which makes implementing machine learning in python a lot easier it provides you the ability to use many supervised and unsupervised algorithms just by importing those algorithms using the library spacey is an open source software library for advanced natural language processing written in programming languages like python and cyton now that we have an overview of the project let's start with its implementation so to implement this project using python as the programming language i'll be using anaconda jupiter so let's start with first installing word cloud after this let's install pc moving ahead let's import all the necessary libraries so these are all the libraries we'll need for this project let's import them moving ahead let's import our data set so this is the name of the file of our data set i've already uploaded it in anaconda jupiter hence i don't need the whole path i'm just putting the file name let's see how the data set looks like so these are the variables that we have in our data set let's also check the shape of our data set so here in our data set we have 17 880 rows and 18 columns let's also check how many null values we have and which columns in our data set have null values so we have many columns here which have a large number of null values so since many of our columns have a lot of null values what we can do is let's just drop these columns from our data set so these are the columns that we'll be dropping from our data set again let's check our data set now so as you can see the column names that we have written here have been deleted from our data set now to handle rest of the missing values from other columns that we haven't dropped but still have null values let's do one thing instead of n a n let's fill that with a blank so we haven't deleted all the columns that have null values over here this is because these variables don't really have a lot of null values like you can see the department variable that we have here we haven't deleted it although it has null values because it is definitely going to help us in our future prediction so now let's fill nan with a blank moving ahead let's compare the number of fraudulent and non-fraudulent job postings so here for visualization let's use cbon c1 is a great visualization library provided by python let's use that and visualize this so as you can see majority number of jobs posted are non-fraudulent while some of the jobs posted are fraudulent let's understand the exact count of number of fraudulent and non-fraudulent job postings in our data set so here you can see in our data set 17 014 jobs are non-fraudulent while 866 jobs are fraudulent moving ahead let's visualize the number of jobs posted and the experience that they were expecting so required experience is one of the variables that we have in our data set and what we need to do is we need to delete all the blanks that are present in that particular columns data set so what this function does is it is a dictionary function and it will create a key value pair for categories that are present in the required experience column and the count that is the number of jobs posted for that particular category so as you can see this is the key value pair that we have let's visualize it now so as you can see here that the maximum number of jobs posted required experience of mid senior level after that we have entry level then associate and so on moving ahead let's also visualize the number of jobs posted based on countries so here in our data set we have a column named location but it specifies the country the state and the city as well but we want only the country so let's split it and create another column for just country so the column name we had was location and as we saw the country was on the first place okay so zeroth position after that we had state and then the city since we want country we are going to give it a zero since it's the first position and we are going to create a column named country in our data set and store that over there let's see the data set so as you can see a country column has been created in our data set now a data set has numerous number of countries but let's visualize it for some top 14 countries okay the number of jobs posted for the top 14 countries let's do that again similarly what we had done before we are going to create a dictionary of key value pair and then visualize it so here we have the top 14 countries and the count let's visualize this so you can see that the maximum number of jobs posted online were for united states moving ahead let's also visualize the education level and the number of jobs posted for each of that education level again we are going to create a key value pair or the dictionary and then visualize the data so here we are going to consider only the top 7 education levels for visualization so here we have the key value pair let's also visualize it so here you can see that in maximum number of jobs the education level required was a bachelor's degree moving ahead let's see the titles of the jobs posted which were not fraudulent let's look at the most commonly used top 10 titles so these were the most common titles that were used when the jobs posted were not actually fraudulent similarly let's see the titles that were used when the job posted was fraudulent so these were the top 10 titles used for jobs that were fraudulent so now let's do one thing let's combine all the columns that we have in our data set into one single column as text will be remaining with only two columns one is whether or not the job posted was fraudulent or not and the other column will be a text column which will have all the text including all the columns data also let's delete the rest of the columns so while combining the text we have included only these columns as these are the most important columns from our data set so rest of the columns will be deleted now let's see our data set so as i said we are left with only two columns here one is whether or not the job posted is fraudulent and other is a text column now let's create a word cloud based on text for fraudulent jobs and non-fraudulent jobs respectively so here we are storing all the fraudulent job text in this variable and the non-fraudulent job text in the next variable that we have moving ahead let's create a word cloud now so here we have a word cloud of texts that were present in the fraudulent jobs that were posted similarly let's create a word cloud for the texts that were present in non-fraudulent jobs so here you can see the word cloud for text in non-fraudulent job has also been created so to work with english words in natural language processing we need to install some more packages here so let's install those so it has been successfully installed now let's start with pre-processing our data that we have so first let's create a list of punctuation marks after that let's create a list of stop words then let's load english tokenizers staggers parsers and word vectors moving ahead let's create a tokenizer function moving ahead let's create a token object which is used to create documents with linguistic annotations then let's limitize each token and convert each token into lowercase then let's remove all the stop words after that let's return the processed list of tokens now let's custom transform using spacey so here we are cleaning the text let's remove the spaces and convert text into lowercase so we are cleaning the text of our data set so next we are going to use tfidf vectorizer so it's a different kind of count vectorizer it gives equal weightage to all the words that is a word is converted to a column for each document it is equal to 1 if it is present in the document else it is 0. apart from giving this information tfidf also shows how important the word is to the document so here's where the concepts of term frequency and inverse document frequency play a role now let's apply it on our text now let's see how main underscore df looks like so these are all the words that are present in our data set and these are the frequencies of the appearance of these words moving ahead let's split our data set into test and train so we are going to include all the columns that are present the columns are nothing but the words and only the last column which is whether or not it is fraudulent like the job posted is fraudulent or not that is going to be in y just the fraudulent column is going to stay in y let's split our data set so i'm splitting the data set into 70 and 30 70 percent will be training and 30 will be for testing let's check the shape of the data that has been split into four parts so here you can see the shape of x strain y train x test and y test now the algorithm that we'll be applying for this data set is random forest let's import that so we'll be applying the random forest algorithm to train our data set so here we have x train and y train and we are going to create a model now that we have trained the model let's see how it predicts so based on the data that is present in xtest let's just visualize x test once for you to understand what exactly is present in xtest so this is how x test looks like so x test does not have the last column that is the fraudulent column whether or not the data is fraudulent so based on this data it is going to predict whether or not this given data is fraudulent okay so initially here we have already trained our data set using random forest now it's time to test it so we are going to provide our model the x test values and based on it it will predict the possible outcomes that is whether or not it is fraudulent so let's see the prediction now so the prediction is going to be stored in spread variable here let's compare how well it has predicted okay so whatever the prediction accuracy is it's going to be stored in this variable score so what we are doing here is nothing but we are comparing the actual values that are present in the y test and the prediction made by random forest so the accuracy that we have here is 96.7 around 97 approximately which is a good enough accuracy let's also create a confusion matrix to understand the model accuracy so first let's create a classification report along with that a confusion matrix so here in the classification report you can see all the values so as to better understand how our model works apart from that we also have confusion matrix so overall the accuracy seems good you can always try some other algorithms like support vector machine as we have discussed before and check if it provides a better accuracy than this so this is how the project works if you have any queries please feel free to mention it in the comment section below with this i close this session thank you and have a great day i hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at the earliest do look out for more videos in our playlist and subscribe to edureka channel to learn more happy learning you