To extract skills from job adverts we took an approach of training a named entity recognition (NER) model to predict which parts of job adverts were skills (“skill entities”) and which were experiences (“experience entities”).
To train the NER model we needed labelled data. First we created a random sample of job adverts and got them into a form needed for labelling using Label Studio. More about this labelling process can be found in the
There are 3 entity labels in our training data:
The user interface for this labelling task looks like:
We tried our best to label from the start to end of each individual skill, starting at the verb (if given):
Sometimes it wasn’t easy to label individual skills, for example an earlier part of the sentence might be needed to define the later part. An example of this is “Working in a team and on an individual basis” - we could label “Working in a team” as a single skill, but “on an individual basis” makes no sense without the “Working” word. In these situations we labelled the whole span as multi skills:
Sometimes there were no entities to label:
EXPERIENCE labels will often be followed by the word “experience” e.g. “insurance experience”, and we included some qualifications as experience, e.g. “Electrical qualifications”.
For the current NER model, 5641 entities in 375 job adverts from our dataset of job adverts were labelled; 354 are multiskill, 4696 are skill, and 608 were experience entities. 20% of the labelled entities were held out as a test set to evaluate the models.