Published On: February 15, 2023Categories: Legal, AI
In recent years, society has changed radically thanks to technological advances. Nowadays it is difficult to imagine life without our smartphones, the latest generation of computers, TV platforms, etc. Technology is here to make our lives easier, and this is synonymous with progress. Recently, technological research has focused on the Artificial Intelligence (AI). This is a truly historic milestone. Developers are creating amazing functionalities that were inconceivable at the end of the 20th century. Among them, we can find:
Entre ellas, podemos encontrar:
Understanding human speech, such as Siri or Alexa, which can perform actions based on our words.
Text translation, such as that offered by Google, which can provide a solid translation of whole text fragments.
Self-driving cars, which are autonomous vehicles capable of staying in a lane, avoiding obstacles or even parking.
This is just a very brief list of some of the functionalities that go hand in hand with AI. Today we are going to introduce a key concept in the context of Natural Language Processing (NLP), that is, the subfield of AI that deals with interactions between computers and human language. This is called Entity Recognition (NER). Can you imagine being able to extract the most important ideas from a document without having to read and understand the whole text? This is where the NER comes in.
NER involves identifying the key information in the text and classifying it into a set of predefined categories. Let us take an example from the legal world. Imagine you need to know the details of an employment contract, such as the employer, the name of the employee, the salary or the duration of the contract. All this information could be scattered throughout the document, and we could spend a lot of time trying to find all these details. Wouldn’t it be great to have an application that automatically reads and processes your document and returns all this information to you? It would of course save us a lot of time and effort. And now this functionality is at your fingertips thanks to Entity Recognition.
But how does NER work? In essence, NER is a two-step process. First, the algorithm must detect an entity in a text. Secondly, it must categorise the detected entity. Frequent categories are person (PER), location (LOC), organisation (ORG), currency (MON) or date (DATE). Let us illustrate this with an example:
As can be seen, the model starts from the sentence “My name is Galo Gonzalvo, I live in Valencia and I work at Bounsel S.L.” and extracts three entities, the person “Galo Gonzalvo”, the location “Valencia”, and the organisation “Bounsel S.L.”. Specifically, it is a NER-specific RoBERTa model powered by Facebook AI. It is trained on the CoNLL-2003which contains more than 20,000 entities tagged in more than 20,000 phrases. First, the algorithm must detect an entity in a text. You can test this model through the HuggingFace API by following this link.
Let us now go into some details. The first step is to detect an entity, and this is often done using Inside-outside-beginning (IOB) tagging. As shown in the example above, a given entity can be simply a word (or token), but it can also be a combination of tokens. In this case, I- and B- are prefix tags associated with a detected entity, while an O-tag indicates that a word does not belong to any entity. Specifically, the prefix I- indicates that a given token is within a detected entity, and the prefix B- indicates that the detected entity is the beginning of an entity that immediately follows a previous entity with no O-tags between them. Let’s illustrate this with another example: “Maria is visiting her grandmother in the Canary Islands, Spain”:
Internally, IOB labelling would work as follows:
As can be seen, “Maria” as well as “Canary Islands” and “Spain” are correctly labelled. “Spain” contains a B- tag only to clearly differentiate it from “Canary Islands”, as both are locations but refer to different entities.
Up to this point, we have been illustrating how NER works using state-of-the-art deep learning models, such as RoBERTa. However, three main techniques can be used to build our NER model:
This is the simplest use case. The idea is to create a database with entities and their corresponding categories. For example, one could create a look-up table with all the countries in the world, and every time the algorithm finds an occurrence of any of them in a piece of text, it would tag it as a country entity.
Regular expressions are patterns with a given structure. For example, you could create a regular expression of the form “DD-MM-YYYYY”, where D, M and Y are all digits, and each time this pattern appears in a text it will be labelled as a date entity.
The deep learning NER is the next generation NER and is based on AI principles. It uses a method called word embedding, which associates a numerical vector with a given word to understand the semantic and syntactic relationships between the various components of the text.
Now you may be wondering, if deep learning models are the most powerful, why do we need to consider lookup tables or regular expressions? Well, the answer to this question is that theeep learning models are robust as long as training data are available. AI learns from experience and therefore it is crucial to have a large enough dataset to capture the subtleties of a specific use case. In general, the usual procedure is to obtain a general NER model, such as RoBERTa, which is trained on a sufficiently large dataset, and then tune it for a specific use case (e.g., legal documents, as we do in Bounsel 😉). To obtain the training data for such a case, it can be useful to use lookup tables and regular expressions, which is a great way to start training your model for the problem at hand.
In addition, for some specific entities such as dates or money, which have very particular patterns, regular expressions have proven to be extremely useful. So it is clear that deep learning models are the most powerful and sophisticated NER, but it is worth bearing in mind that there are other possibilities that can be a good starting point for refining models for more specific tasks.
I hope that after reading this article it will be clearer how the NER works and how useful it can be to extract the most important entities from a given text. If you are still not convinced, I will conclude this article with some of the applications of NER techniques, which are used on a daily basis in many different fields. For example, they are widely used in human resources departments, which use these techniques to highlight key elements of CVs and speed up the recruitment process.NER is also used by search engines to extract the key components of a query and find relevant information much more easily; and article categorisation uses NER to extract the people, organisations and places involved in articles and automatically place them within a given category.
Here at Bounsel we apply AI techniques to make contracts more understandable, easier to read and bring them closer to people. In particular, we train our own models on a specific corpus of legal documents to make them extremely robust for all tasks in the legal-legal world. This saves a lot of time and effort for our clients, who can easily manipulate their legal documents within the Bounsel platform and use our NER models to identify key pieces of information leveraging state-of-the-art AI techniques. If you want to learn more about the amazing AI functionalities included in Bounsel solutions, stay tuned to find out more!