Build your own Resume Scanner Using Python

Let’s think of a fictional scenario. You were looking for a change in your job. You started navigating through a job portal. Suddenly, a job description caught your attention. It looked like a perfect fit! It seemed that the role was created just for you only! Quickly, you uploaded your resume to apply. You were so sure that you would get a call soon for an interview. But unfortunately, that call never came! Did that ring a bell? Did that happen to you? For me, it’s “Been there done that”! (a lot!)

Well, finding a job is a complex process. There are hurdles at different levels. Sadly, I never realized why I am not getting the first call until a few days back when I called the recruiter immediately after I submitted for a role. She said, “Your profile is a 36% match to this role and I will call you back!” I was not sure what she was talking about! Then I came to know that our resumes are scanned by an automated NLP program even before reaching a hiring manager or even before catching a human eye. So I learned that the name of the first obstacle is Applicant Tracking System (ATS).

I am sure different companies use resume scanners of different complexities. I want a simple one – my very own resume scanner. So this post is all about creating your own resume scanner – A program to see how well your resume matches a specific job description.

Approach:

I want to create a Python program that will return the percentage % match between a resume and a job description. Also, I will create a word cloud using the Job description so that we get a clear view of all the important keywords.

Install & Import Libraries:

First, we are going to import all the libraries required for this project.

Now, Resumes do not have a fixed file format, and hence they can be in any file format such as .pdf or .doc, or .docx. So our first challenge is to read the resume and convert it to plain text. For this, we can use two Python modules: pdfminer and doc2text. These modules help extract text from.pdf or .doc, or .docx fle formats.

pip install pdfminer
pip install docx2txt

Let’s import all the libraries required for this project.

 import io
 from pdfminer.converter import TextConverter
 from pdfminer.pdfinterp import PDFPageInterpreter
 from pdfminer.pdfinterp import PDFResourceManager
 from pdfminer.pdfpage import PDFPage
 #Docx resume
 import docx2txt
 #Wordcloud
 import re
 import operator
 from nltk.tokenize import word_tokenize 
 from nltk.corpus import stopwords
 set(stopwords.words('english'))
 from wordcloud import WordCloud
 from nltk.probability import FreqDist
 import matplotlib.pyplot as plt
 from sklearn.feature_extraction.text import CountVectorizer
 from sklearn.metrics.pairwise import cosine_similarity

Reading the Resume:

Here, I will create two different functions. One to read resumes in pdf file format. Another one to read in .doc format. Both of the function will return the text in the Resume.

Read PDF Resume:

def read_pdf_resume(pdf_doc):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    with open(pdf_doc, 'rb') as fh:
        for page in PDFPage.get_pages(fh, caching=True,check_extractable=True):           
            page_interpreter.process_page(page)     
        text = fake_file_handle.getvalue()     # close open handles      
    converter.close() 
    fake_file_handle.close() 
    if text:     
        return text

Read word Resume:

def read_word_resume(word_doc):
     resume = docx2txt.process(word_doc)
     resume = str(resume)
     #print(resume)
     text =  ''.join(resume)
     text = text.replace("\n", "")
     if text:
         return text

Create a Wordcloud with Keywords

How about a graphical image which will display the keywords in the Job Description? I am always a big fan of Word Clouds. If you are scanning a job description you may miss a few skills that the role demands. May be you have some experience in those skills and did not remember to add in your resume. Thus, a word cloud will flash those keyoword for a quick review.

Clean the Job Description:

To create a word cloud I usually clean the text first to avoid word repetitions or punctuations or numbers because those doesn’t make much sense in a word cloud.

def clean_job_decsription(jd):
     ''' a function to create a word cloud based on the input text parameter'''
     ## Clean the Text
     # Lower
     clean_jd = jd.lower()
     # remove punctuation
     clean_jd = re.sub(r'[^\w\s]', '', clean_jd)
     # remove trailing spaces
     clean_jd = clean_jd.strip()
     # remove numbers
     clean_jd = re.sub('[0-9]+', '', clean_jd)
     # tokenize 
     clean_jd = word_tokenize(clean_jd)
     # remove stop words
     stop = stopwords.words('english')
     clean_jd = [w for w in clean_jd if not w in stop] 
     return(clean_jd)

Create a word cloud:

Now, it’s time to create the image.

def create_word_cloud(jd):
     corpus = jd
     fdist = FreqDist(corpus)
     #print(fdist.most_common(100))
     words = ' '.join(corpus)
     words = words.split()
     
     # create a empty dictionary  
     data = dict() 
    #  Get frequency for each words where word is the key and the count is the value  
    for word in (words):     
        word = word.lower()     
        data[word] = data.get(word, 0) + 1 
    # Sort the dictionary in reverse order to print first the most used terms
    dict(sorted(data.items(), key=operator.itemgetter(1),reverse=True)) 
    word_cloud = WordCloud(width = 800, height = 800, 
    background_color ='white',max_words = 500) 
    word_cloud.generate_from_frequencies(data) 
    
    # plot the WordCloud image
    plt.figure(figsize = (10, 8), edgecolor = 'k')
    plt.imshow(word_cloud,interpolation = 'bilinear')  plt.axis("off")  plt.tight_layout(pad = 0)
    plt.show()

Get Job Description and Resume Match Score

Now, we are at the final part of our project. To get a score of how the resume matches a specific job description, I am going to use a Cosine Similarity metric. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. The smaller the angle, the higher the cosine similarity. In this context, the two vectors are arrays containing the words of two documents.

Now, a commonly used approach to matching similar documents is based on counting the maximum number of common words between the documents. But there is a problem with this approach. As the size of the document increases, the number of common words tend to increase even if the documents talk about different topics.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word ‘python’ appeared 50 times in one document and 2 times in another) they could still have a smaller angle between them. Thus, smaller the angle, higher the similarity.

Okay, so lets create a function to find the match score!

def get_resume_score(text):
    cv = CountVectorizer(stop_words='english')
    count_matrix = cv.fit_transform(text)
    #Print the similarity scores
    print("\nSimilarity Scores:")
     
    #get the match percentage
    matchPercentage = cosine_similarity(count_matrix)[0][1] * 100
    matchPercentage = round(matchPercentage, 2) # round to two decimal
     
    print("Your resume matches about "+ str(matchPercentage)+ "% of the job 
          description.")

Test Resume Scanner:

Finally, it is time to get a score! I am using my personal resume and copied it in the same folder so that it can be read by this program. Now, let me get a Job Description from a Job portal. I took a Data Analyst Job Description and let’s see how well my profile matches with this specific role.

What you'll do:
The role involves partnering very closely with multiple PMs, Engineers, Test
Managers and Business Partner to elevate the site experience for the
verticals on Walmart. Analyze click stream data to understand how
customers are interacting with the site. Uncover user pain points and help in
building inspirational experiences.
Provide and supports the implementation of product solutions
Provide data driven insights and deliver recommendations that address opportunities for product improvements
Provide analytical support to Product Managers
Ensure accuracy of data capture strategy
A/B Test: Test variations on messaging or features.
Display dashboards: Visualize data with templated or custom
reports. Create effective reporting and dashboards.
Measure: Measure engagement by feature
A self-starter: Can drive projects with minimal guidance
Strong communicator: You effectively synthesize, visualize, and
communicate your ideas to others


You’ll sweep us off our feet if…
You’re able to use metrics to improve performance
You’re excited about solving complex challenges
You’re customer-centric in spirit and in execution
You’re comfortable influencing others, leading teams, managing stakeholders, and communicating clearly
You have a test and learn mentality and an agile way of working to improve your product

Let’s run all the functions created above and get a score!

if name == 'main':
     extn = input("Enter File Extension: ")
     #print(extn)
     if extn == "pdf":
         resume = read_pdf_resume('Resume_OindrilaSen.pdf')
     else:
         resume = read_word_resume('test_resume.docx')
    
    job_description = input("\nEnter the Job Description: ") 
    ## Get a Keywords Cloud 
    clean_jd = clean_job_decsription(job_description) 
    create_word_cloud(clean_jd) text = [resume, job_description] 
    
    ## Get a Match score
    get_resume_score(text)

My Goodness!

Similarity Scores:
Your resume matches about 26.82% of the job description.

Today, I got an answer for all my speculations. So, the takeaway for today is if a job description looks like a good fit, I need to run this program and check where my resume stands. Thus the resume scanner can tell you a story – a real one!

I have uploaded my Jupyter Notebook for resume scanner program in my Github.

Also, if you are looking for some other project ideas, take a look at my below projects:

Deep Learning Model to Generate Text using Keras LSTM

Build and deploy a multi-page Flask application on Heroku

Text Analytics on #coronavirus trends in Twitter using Python

Thank You for reading this article. I hope it’s helpful to you all! If you enjoyed this article and found it helpful please leave some claps to show your appreciation.

Thank You!