Job Quality Extractor
Overview
This package is designed to extract and analyze job quality measures from job adverts using natural language processing techniques. The package identifies job quality aspects based on sentence similarity and pre-defined target phrases, helping to classify and quantify job quality indicators from large datasets of job descriptions. This work was funded by the Economic Statistics Centre of Excellence.
What dimensions of job quality do you extract?
The term "job quality" refers to aspects of a job that affect worker wellbeing - for example how much the job is paid, and whether the contract is permanent. Most research on job quality rightly focuses on data from the employee's point of view, using surveys or interviews or, recently, online reviews.
We took as our starting point CIPD's seven dimensions of job quality:
- pay and benefits
- contract (elsewhere called terms of employment)
- work-life balance
- job design and the nature of work
- relationships at work
- employee voice
- health and wellbeing
We also added an additional category, ‘barriers to access’, to our taxonomy, so that dimensions of job quality that directly impact marginalised groups might be gathered together. We made one further addition, “atmosphere, culture and environment”, which fits under “Social support and cohesion” and which we took from Sleeman 2024. Our taxonomy of job quality can be seen here.
Installation
To install the package, run
pip install git+https://github.com/nestauk/dap_job_quality.git
Quickstart
To extract dimensions of job quality from a single job advert or from a list of job adverts, you can use the extract_job_quality()
function. This function takes a dataframe of job adverts as input, and returns
- A dataframe with the job adverts split into sentences; each sentence is labelled 0 or 1 according to whether it is related to job quality, and sentences labelled 1 are also matched to the taxonomy.
- A concise dict which just contains the ID of each advert, and the target phrases that it was matched to.
Example usage:
from dap_job_quality.pipeline.find_job_quality import JobQuality
import pandas as pd
# Initialize JobQuality class
job_quality = JobQuality()
job_quality.load()
# Example job adverts dataframe
job_adverts = pd.DataFrame(
[
{'id': 123, 'description': '[This is a job advert. It has many benefits such as a pension scheme and a cycle to work scheme.]'},
{'id': 234, 'description': '[This is a job advert for a bank job. There are free childcare vouchers. We also offer a yearly bonus and generous salary.]'}
]
)
# Extract job quality
jq_df_filtered, job_id_to_target_phrase = job_quality.extract_job_quality(
job_adverts, id_col="id", text_col="description"
)
The output dataframe jq_df_filtered
should look like this:
id | description | clean_description | job_quality_label | sentences_split | ngrams | target_phrase | cosine_similarity | subcategory |
---|---|---|---|---|---|---|---|---|
123 | This is a job advert. It has many benefits su... | This is a job advert. It has many benefits suc... | LABEL_1 | It has many benefits such as a pension scheme ... | a cycle to work | Cycle to work | 0.965111 | PERKS |
123 | This is a job advert. It has many benefits su... | This is a job advert. It has many benefits suc... | LABEL_1 | It has many benefits such as a pension scheme ... | many benefits such as | benefits | 0.874949 | PERKS |
123 | This is a job advert. It has many benefits su... | This is a job advert. It has many benefits suc... | LABEL_1 | It has many benefits such as a pension scheme ... | such as a pension | pension | 0.821573 | COMP |
123 | This is a job advert. It has many benefits su... | This is a job advert. It has many benefits suc... | LABEL_1 | It has many benefits such as a pension scheme ... | a pension scheme and | pension scheme | 0.964935 | COMP |
234 | This is a job advert for a bank job. There ar... | This is a job advert for a bank job. There are... | LABEL_1 | There are free childcare vouchers. | There are free childcare vouchers. | childcare vouchers | 0.838904 | CARING |
234 | This is a job advert for a bank job. There ar... | This is a job advert for a bank job. There are... | LABEL_1 | We also offer a yearly bonus and generous salary. | bonus and generous salary. | compensation | 0.576268 | COMP |
234 | This is a job advert for a bank job. There ar... | This is a job advert for a bank job. There are... | LABEL_1 | We also offer a yearly bonus and generous salary. | a yearly bonus and | performance bonus | 0.618560 | COMP |
Meanwhile, the more concise output, job_id_to_target_phrase
, should look like this:
{
123: ['Cycle to work', 'benefits', 'pension', 'pension scheme'],
234: ['childcare vouchers', 'compensation', 'performance bonus']
}
How does it work?
The pipeline comprises 4 basic steps:
- Clean the text minimally, then separate the advert into sentences
- Classify the sentences as either relating to job quality (eg "We are a friendly supportive team") or not relating to job quality (eg "You must have a friendly supportive demeanour"). More detail on the classifier here and in the README.
- Chunk up the sentences
- Match the sentence chunks to the taxonomy (more detail on steps 3 and 4 here)