Matching text to the job quality taxonomy

Taxonomy development

We conducted an initial analysis to assess (a) which, if any, dimensions of job quality might manifest in job adverts, and (b) what language would be used to express these. We then created a taxonomy that comprised:

The higher level dimension of job quality, eg “Job design and nature of work”
Sub-categories within that which were taken from the CIPD Good Work Index 2023 and Measuring Good Work, eg “career progression”, “learning and development”, “sense of purpose”
The most common phrases that we saw in job adverts that related to these sub-categories. For example, for the sub-category “learning and development”, we included the phrases “CPD” (Continuous Professional Development), “learning and development”, “training”.

This means that not all of the CIPD's dimensions and subdimensions of job quality are in our taxonomy, because some do not appear or cannot be inferred from job adverts. For example, trade union existence and activity is not something that is typically mentioned in job adverts. Similarly, the subdimension "use of skills" refers to whether an employee is in employment that makes use of their specific skillset, so it is dependent on the individual and cannot be inferred from the job advert.

You can find the final taxonomy here.

Using a different taxonomy

We developed a taxonomy that fit our purposes, but it is possible to use your own taxonomy - though we cannot guarantee performance, as our pipeline has been evaluated against the specific taxonomy we developed. If you wish to use your own taxonomy, the steps that you will need to follow are:

Modify the getter function get_keywords() so that it points to your taxonomy file. This should be a dataframe (saved as .csv, .parquet or similar) with the columns dimension, sub_category, target_phrase. target_phrase contains the strings that will be embedded and compared to the job advert text.
You should evaluate the performance of the pipeline against your taxonomy. This can be done using the scripts in dap_job_quality/analysis/mapping_evaluation/ and the notebook dap_job_quality/notebooks/Evalution.ipynb.

Mapping approach

The overall approach taken is to chunk up an input sentence into pieces that are then compared to target phrases from a taxonomy.

Initial cleaning: The input sentence is chunked up into smaller pieces. At first, it is split up on characters that commonly indicate a list, eg ":" (see split_text()). Digits are also replaced with 'X' because for our purposes, the exact number is not important: for example, we would like our pipeline to treat "25 days of annual leave" and "30 days of annual leave" as the same.
Sentence chunking: Then, if the sentence is 6 words or fewer, it is kept whole; otherwise, a rolling window of 4 words is applied. We will refer to these smaller chunks as ngrams - mostly they will be 4 words long, but some may be shorter and some may be as long as 6 words. We chose this rolling window size as it is similar to the lengths of the target phrases we match to in the next steps.
Embedding: both the ngrams and the target phrases from the taxonomy are embedded, both using the pretrained model all-MiniLM-L6-v2. This model was chosen because it is relatively small, optimised for sentence similarity tasks, and has been pretrained on a large corpus of text.
Cosine similarity: The cosine similarity between the ngrams and the target phrases is calculated. If the similarity is above a certain threshold, the ngram is considered a match to the target phrase.

Different subcategories within the taxonomy have different thresholds:

Category	Cosine similarity threshold
CAREER	0.6
FLEX_HOURS	0.65
HOURS	0.6
FLEX_LOC	0.6
LEAVE	0.65
CONTRACT	0.5
LOC	0.6
OTHER	0.55

These are intended to optimise for precision vs. recall metrics for each subcategory. See the notebook Evaluation.ipynb for plots of precision vs. recall trade-offs at different thresholds for different categories. This notebook also demonstrates that 0.55 is the most reasonable threshold value for categories other than "CAREER", "FLEX_HOURS", "HOURS", "FLEX_LOC", "LEAVE", "CONTRACT" and "LOC", as it is the median of the thresholds for other categories (we focused on "CAREER", "FLEX_HOURS", "HOURS", "FLEX_LOC", "LEAVE", "CONTRACT" and "LOC" for the evaluation because they were prioritised for downstream analysis).

Finding the best match: Often, multiple ngrams within a sentence will be matched to the same subcategory. In this case, the ngram with the highest cosine similarity is chosen as the best match. This is demonstrated in the figure below:

Mapping approach