The ExtractSkills class#

class ojd_daps_skills.pipeline.extract_skills.extract_skills.ExtractSkills(config_name='extract_skills_toy', local=True, verbose=True, multi_process=False)[source]#

Class to extract skills from job adverts and map them to a skills taxonomy.

Parameters:
  • config_path (str) – The file name for the config file to be used, defaults to “extract_skills_toy”

  • local (bool) – Whether you want to load data from local files (True, if not found they will be downloaded from a public source) or via Nesta’s private s3 bucket (False, needs access), defaults to True

  • verbose (bool) – Whether to limit the number of logging messages (True) or not (False, good for debugging), defaults to True

  • multi_process (bool) – Whether to use multiprocessing (True) or not (False), defaults to False

ExtractSkills.load(taxonomy_embedding_file_name: Optional[str] = None, prev_skill_matches_file_name: Optional[str] = None, hard_labelled_skills_name: Optional[str] = None, hier_name_mapper_file_name: Optional[str] = None)[source]#

Loads necessary datasets (formatted taxonomy, hard labelled skills, previously matched skills, taxonomy embeddings), JobNER skills extraction class and SkillMapper skill mapper class.

Parameters:
  • taxonomy_embedding_file_name (str, optional) – The relative path to a taxonomy embedding file if it exists. If left unset the embeddings will be generated when the code is run. Defaults to None.

  • prev_skill_matches_file_name (str, optional) – The relative path to a previous skill matches file if it exists. Defaults to None.

  • hard_labelled_skills_name (str, optional) – The relative path to a hard labelled skills file if it exists. Defaults to None.

  • hier_name_mapper_file_name (str, optional) – The relative path to a hierarchy name mapper file if it exists. Defaults to None.

ExtractSkills.extract_skills(job_adverts_skills: Union[str, List[str]], format_skills=False)[source]#

Extract skills from job adverts using a trained NER model and map them to a taxonomy - combines both get_skills and extract_skills. Experiences will also be extracted, but not mapped to a taxonomy. It can also take as input a list of skills and map them to a taxonomy if format_skills is set to True.

Parameters:
  • job_adverts_skills (str or list of strings) – The text of a job advert, a list of job adverts texts, or a list of skills (if format_skills=True)

  • format_skills (bool) – If the input is a list of skills (rather than job adverts) then this needs to be set to True in order to format them correctly, default to False.

Returns:

A list of dictionaries for each job advert containing the skill and experience entities, and for every skill entity where it maps to in the taxonomy. The output combines both multiskill and skill entities together in the “SKILL” key. Each dictionary is in the format {‘SKILL’: [(skill_entity,(taxonomy_skill_name, taxonomy_skill_id)), …]}, ‘EXPERIENCE’: […]]

Return type:

list of dictionaries for each job advert.

ExtractSkills.get_skills(job_adverts: Union[str, List[str]])[source]#

Predict skill/multiskill/experience entities using the NER model in inputted job adverts. Multiskill entities will be split up and converted into individual skill entities where possible.

Parameters:

job_adverts (str or list of strings) – The text of a job advert or a list of job adverts texts

Returns:

A list of entities extracted from each job advert in the form of dictionaries {“SKILL”: [“Microsoft Excel”], “MULTISKILL”: [], “EXPERIENCE”: []}

Return type:

list, the length is equal to the number of job adverts inputted

ExtractSkills.map_skills(predicted_skills: Union[List[dict], List[str]])[source]#

Map skills from job advert(s) to a skills taxonomy. If predicted_skills is a list of skills, it will be formatted accordingly to be mapped to a skills taxonomy. All multiskill entities will be mapped in the same way as skill entities are.

Parameters:

predicted_skills (list of strings or a list of dicts) – A list of skill entities either in the form of a list of strings (assumed to be from the same job advert) or a list of the dictionaries outputted from the get_skills function.

Returns:

A list of dictionaries for each job advert containing the skill and experience entities, and for every skill entity where it maps to in the taxonomy. Multi skill entities are treated as skill entities, and the output combines them together as one. Each dictionary is in the format {‘SKILL’: [(skill_entity,(taxonomy_skill_name, taxonomy_skill_id)), …]}, ‘EXPERIENCE’: […]]

Return type:

list of dictionaries for each job advert.

ExtractSkills.format_skills(skills: List[str]) List[dict][source]#

Format list of skills from a single job advert to be in the format needed for mapping to a taxonomy. Also applies the multiskill splitting to any skills predicted to be multiskills.

Parameters:

skills (str or list of strings) – A list of skills/multiskills from the job advert or a single skill

Returns:

The skills arranged into the format [{“SKILL”: […], “MULTISKILL”: […], “EXPERIENCE”: []}]

Return type:

a list of length 1 containing a dictionary