Learnings from AMLD — 3 Takeaways for Natural Language Processing

Natural Language Processing (NLP) has been booming since 2018 with ELMo BERT &c. In recent months, the field made even more progress. But what gives? What can industry leverage? Three key takeaways from AMLD’s NLP track.

Dennis Meier
4 min readMar 1, 2020

Superstars: Transformer Models

Transformer models are all the rage these days. They have beaten the previously dominant long short-term memory networks in many state of the art NLP tasks. But what are they really about?

The key ingredient: Parallelization. As opposed to long short-term memory networks, Transformer models are not required to ingest sentences one word after another. Instead, they can be fed complete sentences at once.

Leveraging parallel computing is their main advantage, since this allows for training on more data than ever before (i.e. ridiculous amounts of text). Another case in point for the “Bitter Lesson”: more computation wins in the end, as highlighted also in my Takeaways from AMLD Keynotes post.

Transformer Models are all the rage these days. BERT and his siblings are rocking the NLP world.

Researchers are currently working on analyzing and improving on the first generation of these transformers. Then again, the topic is developing so quickly, it’s probably more like the second or third generation at this point. At AMLD, Lena Voita out of Yandex Research presented her work on better understanding Transformer’s inner workings. She is only one of many researchers pushing in this direction. Better explaining the models will help to improve training and effective usage of them in the future.

Transformer models can easily be adapted for cool prototypes and real-world use cases in many types of NLP tasks. The community is making further progress almost daily. For industry, it is thus worth to keep an eye on the recent developments. But beware…

Takeaway 1: Keep an eye on state-of-the art transformer models and their applications your environment.

Troublemakers: Adversarial Examples and Instability

Only a few years ago, image classifiers still used to trip over single pixels. The computer vision community worked hard to overcome these “Adversarial Examples” after the topic gained broad awareness around 2014.

Today, NLP models struggle much in the same ways: A tyqo, a missing *** or even an emoticon 🙊 can trip models and yield unexpected results. Since NLP boomed only more recently, it lags behind computer vision regarding stability in real world applications.

At AMLD, Dominika Basaj from the WildNLP pointed out a lack of unified robustness benchmarks and corresponding tools (including libraries and large data sets) to tackle the problem.

At AMLD 2020, Dominika Basaj from WildNLP highlights problems with the current NLP superstars.

Despite their huge success, BERT and ELMo can struggle — and their performance can drop a lot — when ingesting “adversarial” examples. It looks like the NLP community is basically learning the training data and over-fitting it to a certain degree. Things can turn sour as soon as slight variations in the input appear.

The recency of the NLP hype is one factor, but the problem of adversarial examples is also somewhat more difficult for NLP than it is for computer vision.

WildNLP propose a solution in the form of a Python framework. It can automatically design adversarial examples, generate natural errors and eventually perform adversarial robustness training. The polish team aren’t the only ones working on the problem: others are pushing in similar directions.

The community will have to increasingly employ these kind of tools in order to achieve with NLP models the stability levels required by contact with “real” humans (with potentially bad intentions).

Takeaway 2: Use adversarial training and tools such as WildNLP to build robust models that stand the test of the real world.

Helpers: Semantic Datasets

Another pillar for the ML community has been good, public data sets. They act as catalyst for progress (as do public competitions and pre-trained models). A classic example is ImageNet. Starting out as a public data set, it quickly evolved into an annual competition and catalyzed the boom of powerful computer vision algorithms.

In the NLP community a similar role has been played by WordNet, started in the 80s. It was followed by BabelNet, an effort started roughly 10 years ago that tried to integrate dictionaries for different languages and make algorithms better at tasks such as translation.

Still, multi-language understanding keeps being a challenge. Roberto Navigli, professor at Sapienza University of Rome and co-founder of Babelscape has been at the front of this research problem. He mentions one problem about WordNet and BabelNet is their lack of syntagmatic knowledge.

Working on word sense disambiguation and SyntagNet, Roberto and his team tried to tackle this problem by leveraging syntagmatic relations. Syntagmatic pairs are language elements that can be chained together (e.g. noun-adjective, verb-adverb combinations). Adding such information to data sets improves the performance of algorithms that are trained on them.

Roberto and his team just released SensEmBERT, which combines the best of the BERT world with the semantic world and outperforms the existing state of the art on multilingual word sense disambiguation. It was released at AAAI in February and looks very promising.

Takeaway 3: Combining transformer models with semantic data can yield very good results, especially in multi-lingual domains.

It will be interesting to see what’s next in NLP, and what innovative solutions researchers and companies will come up with in the next few months! I’m staying tuned.

This is part of a series of summaries from AMLD 2020. Follow and check my other work for more articles on applied machine learning and advanced analytics.

--

--