What NLP Has To Say About Radiologists

Lance Reinsmith
3 min readJul 13, 2020
Photo by Umanoide on Unsplash

Natural language processing (NLP) uses computers to analyze human language. SpaCy is a powerful open-source natural language processing library for Python that removes a lot of the heaving lifting from using NLP.

Below is a quick tongue-in-cheek example of using NLP with spaCy to show briefly how it works for semantic analysis using word vectorization.

To install, use either pip:

$ pip install -U spacy

or conda:

$ conda install -c conda-forge spacy

Once installed, you will need to download a language model. For this exercise, I will use the large core English model (this takes a while to download):

$ python -m spacy download en_core_web_lg

Now, you’re all set to start analyzing the English language!

You can characterize text many ways using NLP. One such way is word vectorization where text is converted into numeric values that computers have an easier time manipulating. In particular, spaCy converts text into 300-dimensional vectors. While it is difficult for me to conceptualize vectors in 300-dimensional space, it is much easier for computers to do so!

Let’s start with some imports:

import spacy
from scipy import spatial

SpaCy uses numpy arrays to store the word vectors, but this all happens behind the scenes. The spatial function from scipy will be necessary to compare the angle between vectors, thus showing the similarity between words (smaller angle=more similar).

Now, load the English language model we downloaded before:

nlp = spacy.load('en_core_web_lg')

Let’s think of some English language words to play around with. As a radiologist, I have been accused of not seeing patients and not having much of a personality — two claims I dispute, but let’s see what NLP thinks!

We’ll first load the words, ‘physician’, ‘stethoscope’, and ‘personality’ from the NLP vocabulary of over 1.3 million words and get the vectors for each. We’ll also load the word ‘xray.’

word1 = nlp.vocab['physician'].vector
word2 = nlp.vocab['stethoscope'].vector
word3 = nlp.vocab['personality'].vector
word4 = nlp.vocab['xray'].vector

Since these variables all store mathematical representations of 300-dimensional vectors, we can do simple arithmetic with them. To see what happens when we have a physician, take away his/her stethoscope and personality and give him/her a double dose of xrays, we can calculate:

new_calculated_vector = word1 - word2 - word3 + 2*word4

This new_calculated_vector is not a “word” itself, but is rather a vector that we can compare to the vectors of other words. To do this, let’s loop through the entire lexicon and compare the cosine angle between each word vector and our newly calculated vector.

similarities = []for word in nlp.vocab:
if word.has_vector and word.is_alpha and word.is_lower:
similarities.append((word,
spatial.distance.cosine(new_calculated_vector, word.vector)))

Above, we first created an empty list of word similarities. We then cycled through each word in the entire vocabulary. We filtered out words without vectors or that aren’t made of letters, and we filtered out mixed-case words. Finally, we added a tuple to the list consisting of the word and its cosine distance with (similarity to) the calculated vector from before.

We can then sort this list of tuples by the similarity values in ascending order (remember, smaller angle = more similar).

sorted_similarities = sorted(similarities, key=lambda item: item[1])

Finally, we can print out the five words whose vectors are most similar to the vector we calculated by “physician minus stethoscope minus personality plus 2*xray.”

for top_similar_word in sorted_similarities[:5]:
print(top_similar_word[0].text)

Our results:

xray
xrays
mri
radiologist
radiology

So, it appears that NLP agrees that, as a radiologist, I should recede to my dark room and stay cut off from the outside world! Joking aside, this brief demonstration shows just a fraction of the power of NLP and its ability to analyze language.

Full code:

import spacy
from scipy import spatial
nlp = spacy.load('en_core_web_lg')word1 = nlp.vocab['physician'].vector
word2 = nlp.vocab['stethoscope'].vector
word3 = nlp.vocab['personality'].vector
word4 = nlp.vocab['xray'].vector
new_calculated_vector = word1 - word2 - word3 + 2*word4similarities = []for word in nlp.vocab:
if word.has_vector and word.is_alpha and word.is_lower:
similarities.append((word,
spatial.distance.cosine(new_calculated_vector, word.vector)))
sorted_similarities = sorted(similarities, key=lambda item: item[1])for top_similar_word in sorted_similarities[:5]:
print(top_similar_word[0].text)

--

--