🖇️ NLP Link
NLP Link finds the most similar word (or words) in a reference list to an inputted word. For example, if you are trying to find which word is most similar to 'puppies' from a reference list of ['cats', 'dogs', 'rats', 'birds']
, nlp-link will return 'dogs'.
Another functionality of this package is using the linking methodology to find the SOC code most similar to an inputted job title. More on this here.
🔨 Usage
Install the package using pip:
pip install nlp-link
Basic usage
Match two lists in python:
from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()
# list inputs
comparison_data = ['cats', 'dogs', 'rats', 'birds']
input_data = ['owls', 'feline', 'doggies', 'dogs','chair']
nlp_link.load(comparison_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Which outputs:
input_id input_text link_id link_text similarity
0 0 owls 3 birds 0.613577
1 1 feline 0 cats 0.669633
2 2 doggies 1 dogs 0.757443
3 3 dogs 1 dogs 1.000000
4 4 chair 0 cats 0.331178
Extended usage
Match using dictionary inputs (where the key is a unique ID):
from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()
# dict inputs
comparison_data = {'a': 'cats', 'b': 'dogs', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'owls', 'y': 'feline', 'z': 'doggies', 'za': 'dogs', 'zb': 'chair'}
nlp_link.load(comparison_data)
matches = nlp_link.link_dataset(input_data)
# Top match output
print(matches)
Which outputs:
input_id input_text link_id link_text similarity
0 x owls e birds 0.613577
1 y feline a cats 0.669633
2 z doggies b dogs 0.757443
3 za dogs b dogs 1.000000
4 zb chair a cats 0.331178
Output several most similar matches using the top_n
argument (format_output
needs to be set to False for this):
from nlp_link.linker import NLPLinker
nlp_link = NLPLinker()
comparison_data = {'a': 'cats', 'b': 'dogs', 'c': 'kittens', 'd': 'rats', 'e': 'birds'}
input_data = {'x': 'pets', 'y': 'feline'}
nlp_link.load(comparison_data)
matches = nlp_link.link_dataset(input_data, top_n=2, format_output=False)
# Top match output
print(matches)
# Format output for ease of reading
print({input_data[k]: [comparison_data[r] for r, _ in v] for k,v in matches.items()})
Which will output:
{'x': [['b', 0.8171109], ['a', 0.7650396]], 'y': [['a', 0.6696329], ['c', 0.5778763]]}
{'pets': ['dogs', 'cats'], 'feline': ['cats', 'kittens']}
The drop_most_similar
argument can be set to True if you don't want to output the most similar match - this might be the case if you were comparing a list with itself. For this you would run nlp_link.link_dataset(input_data, drop_most_similar=True)
.