One of the major milestones in NLP community is the release of BERT developed by Google AI. Right after the release, we realized there’s a lot we can do by using BERT to improve NLP components of the DeepPavlov framework (an open-source framework for virtual dialog assistant’s development and text analysis built on TensorFlow and Keras). You can read more about BERT-based models of DeepPavlov in our recent posts.
Let’s demonstrate the Conversational BERT.
Initially, BERT was trained on 104 languages of Wikipedia (Multilingual BERT). In addition to the multilingual version, Google released English BERT (based on English Wikipedia), and Chinese BERT (based on Chinese). We integrated BERT-based models into DeepPavlov achieving state-of-the-art results in many NLP tasks. Moreover we trained RuBERT (BERT based on Russian Wikipedia).
However the formal language of Wikipedia differs from the casual language of the social networks. That’s why we decided to train BERT on social data.
Conversational BERT was trained on the English Twitter, Reddit, DailyDialogues, OpenSubtitles, Debates, Blogs, Facebook News Comments. We used this training data to build vocabulary of English subtokens and took English cased version of BERT-base as initialization for English Conversational BERT.
As result, Conversation BERT introduces new state-of-the-art results in tasks that analyzes social data.
How to use in DeepPavlov Conversational BERT
You can either use our new SOTA Insult detection model based on Conversational BERT [Insults Detection dataset, English Conversational BERT model]. Or use it in any BERT based model using simple guide from documentation.
So, this is pretty everything we wanted to tell you about our Conversational BERT. Stay tuned!