spaCy is a self-proclaimed “Industrial-Strength Natural Language Processing” tool. This post will shed some light on its capabilities outside of named entity recognition.
Author
Lennard Berger
Published
September 28, 2023
spaCy self-proclaims as “Industrial-Strength Natural Language Processing”.
I can wholeheartedly agree. Over many years spaCy has served me extremely well as a base component in a diverse set of projects. The top reasons to use spaCy have always been:
reliability
speed
ease of use
acceptable accuracy out-of-box
Recently, document retrieval systems have been getting a lot of attention. In the light of this, I was curious to see how spaCy would hold up to this task. Usually spaCy is the most well-documented project in the industry. As with anything, there are edge cases.
The idea is that the underlying embedding can be used as the input towards other components, such as the tagger, parser, attribute_ruler and lemmatizer (by default).
In the case of spaCy 3.6.0, two different base architectures are available:
tok2vec
transformer
This blog post deep-dives into the performance of the underlying embeddings.
spaCy’s homebrewed tok2vec
The model page of en_core_web_sm does outline that it uses tok2vec. What exactly is tok2vec though? This particular model architecture was quite new for me. Following down on the Tok2Vec component, it states:
The model to use. Defaults to HashEmbedCNN.
It turns out, this can default to HashEmbedCNN. But it doesn’t need to. Peeking into site-packages’s config.ini reveals the actually used implementation:
spacy.Tok2Vec.v2 architecture
spacy.MultiHashEmbed.v2 embedding layer with width of 96 dimensions
spacy.MaxoutWindowEncoder.v2 encoding layer with width of 96 dimensions
With this knowledge it is possible to run a citation search. From the paper (Miranda et al. 2022) we quote:
To reduce the memory footprint, the default embedding layer in spaCy is a hash embeddings layer. It is a stochastic approximation of traditional embeddings that provides unique vectors for a large number of words without explicitly storing a separate vector for each of them. To be able to compute meaningful representations for both known and unknown words, hash embeddings represent each word as a summary of the normalized word form, subword information and word shape.
They go on to detail how the MultiHashEmbed layer can help reduce computational complexity, thus improving spaCy’s speed.
In the paper they describe evaluations on 5 different datasets focusing on named entity recognition. Most notably OntoNotes (see Weischedel et al. 2013) and CoNLL 2002 (see Sang and De Meulder 2003).
spaCy v3 & the advent of transformers
In 2021 explosion released the next iteration of spaCy. Its most prominent feature is accessible and fast transformer-based pipelines. Like the tok2vec based counterpart, specifics of the model are hard to come by from the documentation. The config.ini let’s us know that en_core_web_trf:
contains a spacy-transformers.TransformerModel.v3 of type roberta-base
was trained on spacy.Corpus.v1
uses a window of 128 and stride of 96 dimensions with a maxout layer to produce 768 dimensional tensors
Unlike en_core_web_sm however there is no technical report on the performance of the model. The benchmarks page outlines that the model performed state-of-the art in 2020.
For comparison with the tok2vec model they include OntoNotes and CoNLL 2002. en_core_web_trf outperforms the classification task for OntoNotes by 26.5% (from 0.66 to 0.89), and 19.4% (from 0.74 to 0.91) for CoNLL 2002 respectively.
How do spaCy models fare up on different tasks?
If we want to know which tasks the spaCy base models can perform well out of the box, we should be evaluating the models more rigidly. Fortunately, the massive text embedding benchmark (Muennighoff et al. 2022) can help us with this. Quoting the authors of MTEB:
MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks.
The core evaluation consists of 67 datasets across seven domains:
Classification
Clustering
Pair classification
Reranking
Retrieval
STS
Summarization
Running spaCy’s models via MTEB
Running MTEB is fairly straightforward. They have a standard template to copy from. The most important part is implementing a custom model:
def encode(self, sentences, batch_size=32, **kwargs):""" Returns a list of embeddings for the given sentences. Args: sentences (`List[str]`): List of sentences to encode batch_size (`int`): Batch size for the encoding Returns: `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences """ifself.trf_model:return [ np.mean([tensor.get() for tensor in doc._.trf_data.tensors[1]], axis=0) iflen(doc._.trf_data.tensors) >1else np.zeros(768, dtype=np.float32)for doc inself.nlp.pipe(sentences, batch_size=batch_size, disable=DISABLED_COMPONENTS, n_process=1) ]else:return [ doc.vector iflen(doc.vector) else np.zeros(96, dtype=np.float32)for doc inself.nlp.pipe(sentences, batch_size=batch_size, disable=DISABLED_COMPONENTS, n_process=-1) ]
/Users/fohlen/PycharmProjects/myblog/posts/missing-spacy-benchmark/venv/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/Users/fohlen/PycharmProjects/myblog/posts/missing-spacy-benchmark/venv/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/Users/fohlen/PycharmProjects/myblog/posts/missing-spacy-benchmark/venv/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
/Users/fohlen/PycharmProjects/myblog/posts/missing-spacy-benchmark/venv/lib/python3.11/site-packages/seaborn/_oldcore.py:1498: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
if pd.api.types.is_categorical_dtype(vector):
Observations
en_core_web_trf is better at classification tasks then en_core_web_sm (albeit not by a large margin)
en_core_web_sm beats en_core_web_trf at all other tasks
The difference between bge-base-1.5, multilingual-e5-small and all-MiniLM-L12-v2 is miniscule
Both en_core_web_trf and en_core_web_sm consistently lack ten or more points behind the models used for comparison
The results are not surprising. If the spaCy authors have followed through on their paper and developed their models for NER classification tasks, it makes sense that they would not (inherently) do well on other tasks. What strikes me as odd is the fact that en_core_web_trf only marginally (+ 1.84) outperforms en_core_web_sm when ran on a wide class of classification problems.
Conclusions
spaCy models are designed with NER in mind
The built-in models of spaCy can’t be used for retrieval tasks
If one wants to build components in spaCy that are not classification tasks, they should be using a different Transformer
One needs to be wary that fine-tuning other models however may be detrimental to their performance, careful evaluation is necessary before deploying a different then the built-in transformer.
Finally, I would like to emphasize that this is in no way meant to belittle the monumental effort and usefulness of spaCy. It remains a battle-tested software for its designed purpose.
Acknowledgments
This evaluation wouldn’t have been possible without compute that was graciously provided by YUKKA Lab AG. Further I would like to thank Mingzhu Wu for all her helpful comments that made this evaluation more complete!
References
Miranda, Lester James, Ákos Kádár, Adriane Boyd, Sofie Van Landeghem, Anders Søgaard, and Matthew Honnibal. 2022. “Multi Hash Embeddings in spaCy.”arXiv Preprint arXiv:2212.09255.
Muennighoff, Niklas, Nouamane Tazi, Loı̈c Magne, and Nils Reimers. 2022. “MTEB: Massive Text Embedding Benchmark.”arXiv Preprint arXiv:2210.07316.
Sang, Erik F, and Fien De Meulder. 2003. “Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition.”arXiv Preprint Cs/0306050.
Weischedel, Ralph, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, et al. 2013. “Ontonotes Release 5.0 Ldc2013t19.”Linguistic Data Consortium, Philadelphia, PA 23: 170.