Attention mechanism and transfer learning

There have been a lot of extremely significant models developed in the field of machine learning in recent years, and the natural language processing (NLP) domain is no exception.

It started in 2017 with the introduction of the new Transformer architecture, enabling highly complex and powerful NLP models. Thanks to the Attention mechanism, they didn’t suffer from a lack of training parallelism, as was the case with previous RNN modes.

Then, in early 2018, the ULMFiT NLP model came out. It benefited from the idea of transfer learning, which in turn gave it the possibility of fine-tuning NLP models using a relatively small amount of labeled data.

Both of these efforts contributed to the discovery of the two most important and advanced models in modern NLP: BERT and GPT.

What is BERT?

The BERT model is the encoder part of the Transformer architecture that was trained with the masked language modeling (MLM) and next sentence prediction objective (NSP).

MLM forces the model during training to guess random hidden words from an input vector. A desirable side effect is the model’s ability to learn the hidden meaning of the input sentence because in order to guess the hidden word, the model must “look through” the entire sentence.

The original BERT was initially trained on a corpus of general texts. Since then, numerous BERT variations have been created, but they all operate on the same fundamental idea.

What is GPT?

The GPT model is a part of the decoder in the Transformer architecture. On the one hand, like BERT, it has been preprocessed on a corpus of general texts, so it’s the inheritor of transfer learning from ULMFiT. On the other hand, it boasts all the advantages of the Transformer model with the attention mechanism in the lead.

This is a “real” language model because its job is to predict the next word based on the given starting context. For this reason, it’s used for tasks other than BERT, such as text generation and summarization, which we’ll discuss later in this post.

Similarly to the BERT model, the GPT has lived to see successors such as GPT-2 and the recently introduced GPT-3. Although they differ in the number of parameters, the main architectural principles remain the same.

What is Hugging Face? What does Hugging Face do?

Numerous successors to BERT and GPT with new heads attached have been developed to solve various downstream tasks. That being said, any such invention was published without particular attention to code standardization, so being able to adapt these models to new tasks wasn’t easy.

In order to solve this problem and bring standardization to the architecture of models and their applicability to further tasks, the Hugging Face library was introduced.

At the very beginning, it was just a startup that provided an open-source repository with implementations of popular Transformer-based models. Now, it’s an excellent source of over 50,000 pre-trained models with source code and heads dedicated to solving various downstream tasks.

The Hugging Face library includes important NLP and image processing datasets that are ready for immediate use. It also features the peripherals necessary to run the models, such as implementations of tokenizers and metrics along with the necessary documentation, usage examples, and even dedicated trainers for various tasks.

The Hugging Face Hub

Most of the NLP tasks can now be solved with the use of modern models from the Hugging Face library that offers the standardized, open-source code for implementation of the models and all the peripherals.

But it’s not everything that Hugging Face provides. It also standardizes the typical NLP pipeline that can get you through the entire process of fine-tuning the model on the particular dataset with well-defined steps.

Load the dataset

The first step is to load the datasets from the collection of over 5,000 available. Of course, we can use our own corpus, whose detailed connection to the library will be discussed later in this article.

A key feature of the datasets provided by the library is that they can be incorporated into Pandas or NumPy libraries, which are very important these days in the data science and machine learning community.

Tokenize the text

Once the text corpus is prepared, it must be fed into the model in some way. So we need to convert our text into a digital representation, just as we did during the initial training process. To do this, we just have to initialize the process of supplying the provided text with the model tokenizer and feed the whole corpus into it.

Most tokenizers are written in the RUST programming language, so the tokenization process doesn’t take much time. Conversely, thanks to the standardization provided by the Hugging Face library, we can use different tokenizers in the same way, regardless of how they work internally.

Fine-tune the model

The model fine-tuning is the process where the pre-trained model is fed with the domain-specific data in order to make it work better in the dedicated environment. Different tasks require different tuning strategies.

Some involve applying a special head on top of the model and training according to a new target, such as cross entropy. In other cases, it’s sufficient for the model to see the new data by training according to the same target as the pre-training.

Regardless of the type of fine-tuning, the necessary heads and targets are provided by a library with advanced standardization, meaning that knowing the task we want to perform, we select a predefined model with the appropriate head, and the rest is done in the same way for most tasks.

How to solve text classification issues with Hugging Face

We begin our exploration of the Hugging Face library with perhaps the most common task in NLP: text classification. Its popularity is related to the wide range of tasks for which it can be used.

For example, an email spam filter is essentially built using a mechanism that classifies a message as spam based on its contents.

A similar mechanism can be used to automatically detect hate speech, which can pinpoint dangerous Twitter or Facebook posts that administrators should pay attention to.

Another example might be a system that scans the entire Internet to check the customers’ general opinion of a product.

In our case, we’ll try to build a system that can classify news articles according to their topics: World, Sports, Business, and Sci/Tech. To do this, we’ll rely completely on the Hugging Face library.

First, we have to load the dataset from the Hugging Face Hub, then use the tokenizers implemented by this library.

The next step is loading the weights of the predefined model from the Hub into the corresponding model class that has a head for the classification task.

Finally, we will use the trainer class to fine-tune and evaluate the model.

1. Loading a dataset

The dataset we’re going to use is named “ag_news” in the Hugging Face Hub. In order to load it, we have to simply import the load_dataset method from the datasets library, then initialize it with the correct dataset name:


This will start downloading the dataset to our local device, then load it into our local variable ag_news.

Now, we can look inside this variable:


We can see how our dataset is created. It’s stored inside the DatasetDict class and is already split into train and test sets. In order to access each of the splits, we have to just call it using standard Python syntax:


Here we can observe that each object in the DatasetDict is then stored as the Dataset class. It’s a crucial class for the entire Hugging Face library, since it’s the required type that the trainer can accept for the model to be fine-tuned.

This means that if we want to use the trainer along with the custom corpus, we’ll have to wrap it into the Dataset class.

We can further access our Dataset object as just a list:


We can also check the column names or the features:


As a result, we can see the metadata on the elements of the dataset: the “text” property is of the string type while the label is represented as the ClassLabel. It’s a special class that provides the capabilities to manipulate these class labels and map the label integers to their names with the use of the int2str method.

We can also access our dataset in the Pandas DataFrame format, which makes the data exploration analysis simpler, since it’s more popular:


In order to see the label names we have, we can use the ClassLabel property int2str that maps the label integers to the proper names:


We can now check whether the training dataset is balanced:


Turns out that it’s perfectly balanced, but this is bound to be a rare situation in the future. In other cases, we will have to take some precautions such as data augmentation, oversampling the minority class, or undersampling the majority class. 

You should also look at the distribution of sentence lengths across classes. The sentence length will be measured as the number of words in the sentence:


word length per class

As you can see, the average length of sentences in each class doesn’t exceed 50 words, with extreme values exceeding 200 words in the third class.

Our classifier will be created by tuning the BERT language model. It can accept about 512 tokens, and assuming that each word can be decomposed into about 1.5 tokens, in each case the number of tokens in the sentences will not exceed the maximum number accepted by the model. If there were any outliers that had more than 512 tokens, they could be pruned.

The only real problem that could arise would be if there were many instances where the length after tokenization would exceed the required 512 tokens. In that case, we would have to treat them differently, but that is beyond the scope of this article.

2. Text tokenization

As mentioned earlier, one of the most important parts of the NLP pipeline is the tokenization of text, as the pre-trained models we’ll use were trained on numerical representations. For this reason, in order to perform the fine-tuning and inference procedure, we need to convert the text to the same representation on which the models were pretrained.

Most modern tokenizers use the subword tokenization technique, which involves splitting words into subword elements, then converting them into identifiers using predefined dictionaries.

This ensures that the dictionary isn’t too large, as rare words are modeled with smaller and more common subwords. This approach is good for spelling errors and can be learned from a corpus of data, so common words will be treated as a single token and rare words will be split, making the distribution of tokens in the dataset balanced.

The Hugging Face library implements several different subword tokenizers that come with different models:

  • the WordPiece used by the BERT encoder,
  • the BPE (Byte-Pair Encoding) used by the RoBERT model,
  • the SentencePiece used by the XLM encoder.

At this stage, it’s important to remember that there are different types of tokenizers. You need to make sure that the tokenizer you use is compatible with the model it was trained on. Having said that, understanding how they differ is beyond the scope of this post.

We’ll use the AutoTokenizer class to tokenize the text, which requires a model ID from the Hugging Face Hub or a local path to the model to automatically load the appropriate tokenizer. Since we’ll be using the cased version of the BERT-base model, the code that loads the tokenizer is as follows:


And now, the tokenization of the text could be performed in the following way. Later on, we’ll be tokenizing and feature-extracting at one batch, so to save the RAM memory, you can:


The padding parameter determines whether examples will be padded to longer examples from the current batch, while the truncation parameter allows the tokenizer to truncate examples longer than the maximum length, which, in this case, is 512 tokens.

3. Training the classifier

As mentioned earlier, the most effective way to train a classifier in NLP is to perform supervised fine-tuning of a pretrained encoder for the downstream task using domain-specific data.

The tokenized text is fed to a pre-trained encoder that returns an array of token embeddings. We get a separate embedding for each token.

So, how do we train the classifier? We can do this in two ways.

The first is to treat the model encoder as a feature extractor. The second is to fine-tune the entire model using an attached classifier head.

Feature extractor

In this case, we extract token embeddings from the given text with the use of the pretrained model. We treat the extracted token embeddings as the text representations and train a classifier directly on them.

We don’t change the weights of the encoder. This is a great approach when we don’t have a GPU available, as it’s much less computationally intensive.

First, we need to load the encoder that will be used as the feature extractor. Just as before, we use the Auto class, which, based on the model identifier or local path, is able to load the appropriate model.


An additional parameter of the AutoModel class over the AutoTokenizer class is output_hidden_states, which ensures that the model returns token embeddings. We also want to process our data on the GPU, so we transfer the model to the VRAM by to(“cuda”).

Since we want the feature extraction process to be as optimized as possible, we’ll process the texts in batches. We’ll tokenize the text first, then pass it to the model to embed the tokens.

We intend to classify the text, so we’ll only retrieve the embedding associated with the first [CLS] token. This is the embedding that can represent the whole sentence.

The entire process of retrieving sentence embeddings from the text given the model and tokenizer is implemented in the helper function:


It’s also important to note that we used torch.no_grad() to disable the calculation of the gradient while calling the model in order to reduce memory consumption.

Also, since the model’s output is of the (batch_size, sequence_length, hidden_size) shape, and we want to get the hidden_output/token_embedding that’s associated with the first token in the sequence, we use the code snippet:


We call cpu().numpy() in order to move the embedding from the GPU to the CPU and cast it to the NumPy array format, as it’ll be easier to handle later on.

Now, we’re going to use our helper function to extract the sentence embeddings:


As we stated before, our embeddings are in the shape of (batch_size, 1, hidden_size). Because of that, we need to flatten them:


We can now define our classifier and fit it with our embeddings. We will use the SVM from the scikit-learn library to define our simple model


Clearly, the model is performing quite well, though it certainly can be better.


Previously, we treated our BERT language model as a feature extractor that created a sentence representation for later use in another machine learning model.

Now, we will attach a dense layer to the BERT language model as a classifier, and during training we’ll update the weight of not only the added layer, but the entire model.

We should create a dataset that can be accepted by the Hugging Face trainer class. In order to do so, we will first define the Dataset class inheriting from the Hugging Face Dataset class:


Then we have to convert the texts into Dataset objects for both the train and test corpus:


Now, we’ll load the encoder with the attached header on top. The Hugging Face library provides us with the special Auto class for that: AutoModelForSequenceClassification.


Then we define a helper function that returns the current performance of our model at the time of evaluation in terms of predefined scores:


Now, we need to define the basic parameters of the deep learning process and import a special Trainer class that performs the training, since it has previously implemented all the training loops.


We’re left with initializing a Trainer class with a predefined dataset, model, metrics, and training arguments. Once we’re done, that’s pretty much it.


After we start the training, let’s evaluate our dataset:


We can see that our model has significantly outperformed the previous classifier. This is because we tuned the whole model and adjusted not only the classification layer, but also the encoder itself, so that the embeddings produced can better discriminate topics.

Final thoughts on using Hugging Face to solve text classification issues

Hugging Face does come with some challenges. You’ll have to understand a few concepts and implementation details before you can use it fully.

However, you’ll be able to always use the same interface for the common problems encountered in different downstream tasks, such as text classification, machine translation, or named entity recognition.

Additionally, the library gives you the ability to use the most modern models and focus your effort on exploring the problems rather than implementation issues. As such, we think it’s well worth trying out.

Thank you for reading our tutorial on solving text classification issues with Hugging Face! We hope that it has helped you see all the advantages of using the Hugging Face library.

If you want to dive even deeper into the world of Hugging Face, we highly recommend taking a look at the Hugging Face website. You’ll find many offers of tutorials and community events there.

Afterwards, when you feel comfortable enough, you can try to build your own project and describe your findings on the Hugging Face blog. By teaching others, you’ll organize your own knowledge and contribute to sharing the idea of NLP and Hugging Face at the same time.

If you’d like to read more about natural language processing, machine learning, and data engineering—look no further. We have many more interesting resources for you to check out, such as:

Aside from ML, AI, and DE, we also offer a wide range of other software development services we can assist you with. Simply let us know how we can help, and we’ll get right on it!