Modeling

Adaptive Model

class farm.modeling.adaptive_model.BaseAdaptiveModel(prediction_heads)[source]

Bases: object

Base Class for implementing AdaptiveModel with frameworks like PyTorch and ONNX.

subclasses = {'AdaptiveModel': <class 'farm.modeling.adaptive_model.AdaptiveModel'>, 'ONNXAdaptiveModel': <class 'farm.modeling.adaptive_model.ONNXAdaptiveModel'>, 'ONNXWrapper': <class 'farm.modeling.adaptive_model.ONNXWrapper'>}
__init__(prediction_heads)[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(**kwargs)[source]

Load corresponding AdaptiveModel Class(AdaptiveModel/ONNXAdaptiveModel) based on the files in the load_dir.

Parameters

kwargs – arguments to pass for loading the model.

Returns

instance of a model

logits_to_preds(logits, **kwargs)[source]

Get predictions from all prediction heads.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • label_maps – Maps from label encoding to label string

  • label_maps – dict

Returns

A list of all predictions from all prediction heads

formatted_preds(logits, **kwargs)[source]

Format predictions for inference.

Parameters
  • logits (torch.tensor) – model logits

  • kwargs (object) – placeholder for passing generic parameters

Returns

predictions in the right format

connect_heads_with_processor(tasks, require_labels=True)[source]

Populates prediction head with information coming from tasks.

Parameters
  • tasks – A dictionary where the keys are the names of the tasks and the values are the details of the task (e.g. label_list, metric, tensor name)

  • require_labels – If True, an error will be thrown when a task is not supplied with labels)

Returns

class farm.modeling.adaptive_model.AdaptiveModel(language_model, prediction_heads, embeds_dropout_prob, lm_output_types, device, loss_aggregation_fn=None)[source]

Bases: torch.nn.modules.module.Module, farm.modeling.adaptive_model.BaseAdaptiveModel

PyTorch implementation containing all the modelling needed for your NLP task. Combines a language model and a prediction head. Allows for gradient flow back to the language model component.

__init__(language_model, prediction_heads, embeds_dropout_prob, lm_output_types, device, loss_aggregation_fn=None)[source]
Parameters
  • language_model (LanguageModel) – Any model that turns token ids into vector representations

  • prediction_heads (list) – A list of models that take embeddings and return logits for a given task

  • embeds_dropout_prob – The probability that a value in the embeddings returned by the language model will be zeroed.

  • embeds_dropout_prob – float

  • lm_output_types (list or str) – How to extract the embeddings from the final layer of the language model. When set to “per_token”, one embedding will be extracted per input token. If set to “per_sequence”, a single embedding will be extracted to represent the full input sequence. Can either be a single string, or a list of strings, one for each prediction head.

  • device – The device on which this model will operate. Either “cpu” or “cuda”.

  • loss_aggregation_fn (function) – Function to aggregate the loss of multiple prediction heads. Input: loss_per_head (list of tensors), global_step (int), batch (dict) Output: aggregated loss (tensor) Default is a simple sum: lambda loss_per_head, global_step=None, batch=None: sum(tensors) However, you can pass more complex functions that depend on the current step (e.g. for round-robin style multitask learning) or the actual content of the batch (e.g. certain labels) Note: The loss at this stage is per sample, i.e one tensor of shape (batchsize) per prediction head.

fit_heads_to_lm()[source]

This iterates over each prediction head and ensures that its input dimensionality matches the output dimensionality of the language model. If it doesn’t, it is resized so it does fit

save(save_dir)[source]

Saves the language model and prediction heads. This will generate a config file and model weights for each.

Parameters

save_dir (Path) – path to save to

classmethod load(load_dir, device, strict=True, lm_name=None)[source]

Loads an AdaptiveModel from a directory. The directory must contain:

  • language_model.bin

  • language_model_config.json

  • prediction_head_X.bin multiple PH possible

  • prediction_head_X_config.json

  • processor_config.json config for transforming input

  • vocab.txt vocab file for language model, turning text to Wordpiece Tokens

Parameters
  • load_dir (Path) – location where adaptive model is stored

  • device (torch.device) – to which device we want to sent the model, either cpu or cuda

  • lm_name (str) – the name to assign to the loaded language model

  • strict (bool) – whether to strictly enforce that the keys loaded from saved model match the ones in the PredictionHead (see torch.nn.module.load_state_dict()). Set to False for backwards compatibility with PHs saved with older version of FARM.

logits_to_loss_per_head(logits, **kwargs)[source]

Collect losses from each prediction head.

Parameters

logits (object) – logits, can vary in shape and type, depending on task.

Returns

The per sample per prediciton head loss whose first two dimensions have length n_pred_heads, batch_size

logits_to_loss(logits, global_step=None, **kwargs)[source]

Get losses from all prediction heads & reduce to single loss per sample.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • global_step (int) – number of current training step

  • kwargs (object) – placeholder for passing generic parameters. Note: Contains the batch (as dict of tensors), when called from Trainer.train().

Return loss

torch.tensor that is the per sample loss (len: batch_size)

logits_to_preds(logits, **kwargs)[source]

Get predictions from all prediction heads.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • label_maps – Maps from label encoding to label string

  • label_maps – dict

Returns

A list of all predictions from all prediction heads

prepare_labels(**kwargs)[source]

Label conversion to original label space, per prediction head.

Parameters

label_maps (dict[int:str]) – dictionary for mapping ids to label strings

Returns

labels in the right format

formatted_preds(logits, **kwargs)[source]

Format predictions for inference.

Parameters
  • logits (torch.tensor) – model logits

  • kwargs (object) – placeholder for passing generic parameters

Returns

predictions in the right format

forward(**kwargs)[source]

Push data through the whole model and returns logits. The data will propagate through the language model and each of the attached prediction heads.

Parameters

kwargs – Holds all arguments that need to be passed to the language model and prediction head(s).

Returns

all logits as torch.tensor or multiple tensors.

forward_lm(**kwargs)[source]

Forward pass for the language model :param kwargs: :return:

log_params()[source]

Logs paramteres to generic logger MlLogger

verify_vocab_size(vocab_size)[source]

Verifies that the model fits to the tokenizer vocabulary. They could diverge in case of custom vocabulary added via tokenizer.add_tokens()

get_language()[source]
convert_to_transformers()[source]
classmethod convert_from_transformers(model_name_or_path, device, task_type)[source]
Load a (downstream) model from huggingface’s transformers format. Use cases:
  • continue training in FARM (e.g. take a squad QA model and fine-tune on your own data)

  • compare models without switching frameworks

  • use model directly for inference

Parameters
  • model_name_or_path

    local path of a saved model or name of a public one. Exemplary public names: - distilbert-base-uncased-distilled-squad - deepset/bert-large-uncased-whole-word-masking-squad2

    See https://huggingface.co/models for full list

  • device – “cpu” or “cuda”

  • task_type – One of : - ‘question_answering’ - ‘text_classification’ - ‘embeddings’ More tasks coming soon …

Returns

AdaptiveModel

convert_to_onnx(output_path, opset_version=11)[source]

Convert a PyTorch AdaptiveModel to ONNX.

The conversion is trace-based by performing a forward pass on the model with a input batch.

Parameters
  • output_path (Path) – model dir to write the model and config files

  • opset_version (int) – ONNX opset version

Returns

class farm.modeling.adaptive_model.ONNXAdaptiveModel(onnx_session, prediction_heads, language, device='cpu')[source]

Bases: farm.modeling.adaptive_model.BaseAdaptiveModel

Implementation of ONNX Runtime for Inference of ONNX Models.

Existing PyTorch based FARM AdaptiveModel can be converted to ONNX format using AdpativeModel.convert_to_onnx(). The conversion is currently only implemented for Question Answering Models.

For inference, this class is compatible with the FARM Inferencer.

__init__(onnx_session, prediction_heads, language, device='cpu')[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(load_dir, device, **kwargs)[source]

Load corresponding AdaptiveModel Class(AdaptiveModel/ONNXAdaptiveModel) based on the files in the load_dir.

Parameters

kwargs – arguments to pass for loading the model.

Returns

instance of a model

forward(**kwargs)[source]

Perform forward pass on the model and return the logits.

Parameters

kwargs – all arguments that needs to be passed on to the model

Returns

all logits as torch.tensor or multiple tensors.

eval()[source]

Stub to make ONNXAdaptiveModel compatible with the PyTorch AdaptiveModel.

get_language()[source]

Get the language(s) the model was trained for. :return: str

class farm.modeling.adaptive_model.ONNXWrapper(language_model, prediction_heads, embeds_dropout_prob, lm_output_types, device, loss_aggregation_fn=None)[source]

Bases: farm.modeling.adaptive_model.AdaptiveModel

Wrapper Class for converting PyTorch models to ONNX.

As of torch v1.4.0, torch.onnx.export only support passing positional arguments to the forward pass of the model. However, the AdaptiveModel’s forward takes keyword arguments. This class circumvents the issue by converting positional arguments to keyword arguments.

classmethod load_from_adaptive_model(adaptive_model)[source]
forward(*batch)[source]

Push data through the whole model and returns logits. The data will propagate through the language model and each of the attached prediction heads.

Parameters

kwargs – Holds all arguments that need to be passed to the language model and prediction head(s).

Returns

all logits as torch.tensor or multiple tensors.

Language Model

Acknowledgements: Many of the modeling parts here come from the great transformers repository: https://github.com/huggingface/transformers. Thanks for the great work!

class farm.modeling.language_model.LanguageModel[source]

Bases: torch.nn.modules.module.Module

The parent class for any kind of model that can embed language into a semantic vector space. Practically speaking, these models read in tokenized sentences and return vectors that capture the meaning of sentences or of tokens.

subclasses = {'Albert': <class 'farm.modeling.language_model.Albert'>, 'Bert': <class 'farm.modeling.language_model.Bert'>, 'DistilBert': <class 'farm.modeling.language_model.DistilBert'>, 'Roberta': <class 'farm.modeling.language_model.Roberta'>, 'XLMRoberta': <class 'farm.modeling.language_model.XLMRoberta'>, 'XLNet': <class 'farm.modeling.language_model.XLNet'>}
forward(input_ids, padding_mask, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_scratch(model_type, vocab_size)[source]
classmethod load(pretrained_model_name_or_path, n_added_tokens=0, language_model_class=None, **kwargs)[source]

Load a pretrained language model either by

  1. specifying its name and downloading it

  2. or pointing to the directory it is saved in.

Available remote models:

  • bert-base-uncased

  • bert-large-uncased

  • bert-base-cased

  • bert-large-cased

  • bert-base-multilingual-uncased

  • bert-base-multilingual-cased

  • bert-base-chinese

  • bert-base-german-cased

  • roberta-base

  • roberta-large

  • xlnet-base-cased

  • xlnet-large-cased

  • xlm-roberta-base

  • xlm-roberta-large

  • albert-base-v2

  • albert-large-v2

  • distilbert-base-german-cased

  • distilbert-base-multilingual-cased

See all supported model variations here: https://huggingface.co/models

The appropriate language model class is inferred automatically from pretrained_model_name_or_path or can be manually supplied via language_model_class.

Parameters
  • pretrained_model_name_or_path (str) – The path of the saved pretrained model or its name.

  • language_model_class (str) – (Optional) Name of the language model class to load (e.g. Bert)

get_output_dims()[source]
freeze(layers)[source]

To be implemented

unfreeze()[source]

To be implemented

save_config(save_dir)[source]
save(save_dir)[source]

Save the model state_dict and its config file so that it can be loaded again.

Parameters

save_dir (str) – The directory in which the model should be saved.

formatted_preds(logits, samples, ignore_first_token=True, padding_mask=None, **kwargs)[source]

Extracting vectors from language model (e.g. for extracting sentence embeddings). Different pooling strategies and layers are available and will be determined from the object attributes extraction_layer and extraction_strategy. Both should be set via the Inferencer: Example: Inferencer(extraction_strategy=’cls_token’, extraction_layer=-1)

Parameters
  • logits – Tuple of (sequence_output, pooled_output) from the language model. Sequence_output: one vector per token, pooled_output: one vector for whole sequence

  • samples – For each item in logits we need additional meta information to format the prediction (e.g. input text). This is created by the Processor and passed in here from the Inferencer.

  • ignore_first_token – Whether to include the first token for pooling operations (e.g. reduce_mean). Many models have here a special token like [CLS] that you don’t want to include into your average of token embeddings.

  • padding_mask – Mask for the padding tokens. Those will also not be included in the pooling operations to prevent a bias by the number of padding tokens.

  • kwargs – kwargs

Returns

list of dicts containing preds, e.g. [{“context”: “some text”, “vec”: [-0.01, 0.5 …]}]

class farm.modeling.language_model.Bert[source]

Bases: farm.modeling.language_model.LanguageModel

A BERT model that wraps HuggingFace’s implementation (https://github.com/huggingface/transformers) to fit the LanguageModel class. Paper: https://arxiv.org/abs/1810.04805

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod from_scratch(vocab_size, name='bert', language='en')[source]
classmethod load(pretrained_model_name_or_path, language=None, **kwargs)[source]

Load a pretrained model by supplying

  • the name of a remote model on s3 (“bert-base-cased” …)

  • OR a local path of a model trained via transformers (“some_dir/huggingface_model”)

  • OR a local path of a model trained via FARM (“some_dir/farm_model”)

Parameters

pretrained_model_name_or_path (str) – The path of the saved pretrained model or its name.

forward(input_ids, segment_ids, padding_mask, **kwargs)[source]

Perform the forward pass of the BERT model.

Parameters
  • input_ids (torch.Tensor) – The ids of each token in the input sequence. Is a tensor of shape [batch_size, max_seq_len]

  • segment_ids (torch.Tensor) – The id of the segment. For example, in next sentence prediction, the tokens in the first sentence are marked with 0 and those in the second are marked with 1. It is a tensor of shape [batch_size, max_seq_len]

  • padding_mask – A mask that assigns a 1 to valid input tokens and 0 to padding tokens of shape [batch_size, max_seq_len]

Returns

Embeddings for each token in the input sequence.

enable_hidden_states_output()[source]
disable_hidden_states_output()[source]
class farm.modeling.language_model.Albert[source]

Bases: farm.modeling.language_model.LanguageModel

An ALBERT model that wraps the HuggingFace’s implementation (https://github.com/huggingface/transformers) to fit the LanguageModel class.

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(pretrained_model_name_or_path, language=None, **kwargs)[source]

Load a language model either by supplying

  • the name of a remote model on s3 (“albert-base” …)

  • or a local path of a model trained via transformers (“some_dir/huggingface_model”)

  • or a local path of a model trained via FARM (“some_dir/farm_model”)

Parameters
  • pretrained_model_name_or_path – name or path of a model

  • language – (Optional) Name of language the model was trained for (e.g. “german”). If not supplied, FARM will try to infer it from the model name.

Returns

Language Model

forward(input_ids, segment_ids, padding_mask, **kwargs)[source]

Perform the forward pass of the Albert model.

Parameters
  • input_ids (torch.Tensor) – The ids of each token in the input sequence. Is a tensor of shape [batch_size, max_seq_len]

  • segment_ids (torch.Tensor) – The id of the segment. For example, in next sentence prediction, the tokens in the first sentence are marked with 0 and those in the second are marked with 1. It is a tensor of shape [batch_size, max_seq_len]

  • padding_mask – A mask that assigns a 1 to valid input tokens and 0 to padding tokens of shape [batch_size, max_seq_len]

Returns

Embeddings for each token in the input sequence.

enable_hidden_states_output()[source]
disable_hidden_states_output()[source]
class farm.modeling.language_model.Roberta[source]

Bases: farm.modeling.language_model.LanguageModel

A roberta model that wraps the HuggingFace’s implementation (https://github.com/huggingface/transformers) to fit the LanguageModel class. Paper: https://arxiv.org/abs/1907.11692

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(pretrained_model_name_or_path, language=None, **kwargs)[source]

Load a language model either by supplying

  • the name of a remote model on s3 (“roberta-base” …)

  • or a local path of a model trained via transformers (“some_dir/huggingface_model”)

  • or a local path of a model trained via FARM (“some_dir/farm_model”)

Parameters
  • pretrained_model_name_or_path – name or path of a model

  • language – (Optional) Name of language the model was trained for (e.g. “german”). If not supplied, FARM will try to infer it from the model name.

Returns

Language Model

forward(input_ids, segment_ids, padding_mask, **kwargs)[source]

Perform the forward pass of the Roberta model.

Parameters
  • input_ids (torch.Tensor) – The ids of each token in the input sequence. Is a tensor of shape [batch_size, max_seq_len]

  • segment_ids (torch.Tensor) – The id of the segment. For example, in next sentence prediction, the tokens in the first sentence are marked with 0 and those in the second are marked with 1. It is a tensor of shape [batch_size, max_seq_len]

  • padding_mask – A mask that assigns a 1 to valid input tokens and 0 to padding tokens of shape [batch_size, max_seq_len]

Returns

Embeddings for each token in the input sequence.

enable_hidden_states_output()[source]
disable_hidden_states_output()[source]
class farm.modeling.language_model.XLMRoberta[source]

Bases: farm.modeling.language_model.LanguageModel

A roberta model that wraps the HuggingFace’s implementation (https://github.com/huggingface/transformers) to fit the LanguageModel class. Paper: https://arxiv.org/abs/1907.11692

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(pretrained_model_name_or_path, language=None, **kwargs)[source]

Load a language model either by supplying

  • the name of a remote model on s3 (“xlm-roberta-base” …)

  • or a local path of a model trained via transformers (“some_dir/huggingface_model”)

  • or a local path of a model trained via FARM (“some_dir/farm_model”)

Parameters
  • pretrained_model_name_or_path – name or path of a model

  • language – (Optional) Name of language the model was trained for (e.g. “german”). If not supplied, FARM will try to infer it from the model name.

Returns

Language Model

forward(input_ids, segment_ids, padding_mask, **kwargs)[source]

Perform the forward pass of the XLMRoberta model.

Parameters
  • input_ids (torch.Tensor) – The ids of each token in the input sequence. Is a tensor of shape [batch_size, max_seq_len]

  • segment_ids (torch.Tensor) – The id of the segment. For example, in next sentence prediction, the tokens in the first sentence are marked with 0 and those in the second are marked with 1. It is a tensor of shape [batch_size, max_seq_len]

  • padding_mask – A mask that assigns a 1 to valid input tokens and 0 to padding tokens of shape [batch_size, max_seq_len]

Returns

Embeddings for each token in the input sequence.

enable_hidden_states_output()[source]
disable_hidden_states_output()[source]
class farm.modeling.language_model.DistilBert[source]

Bases: farm.modeling.language_model.LanguageModel

A DistilBERT model that wraps HuggingFace’s implementation (https://github.com/huggingface/transformers) to fit the LanguageModel class.

NOTE: - DistilBert doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or [SEP]) - Unlike the other BERT variants, DistilBert does not output the pooled_output. An additional pooler is initialized.

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(pretrained_model_name_or_path, language=None, **kwargs)[source]

Load a pretrained model by supplying

  • the name of a remote model on s3 (“distilbert-base-german-cased” …)

  • OR a local path of a model trained via transformers (“some_dir/huggingface_model”)

  • OR a local path of a model trained via FARM (“some_dir/farm_model”)

Parameters

pretrained_model_name_or_path (str) – The path of the saved pretrained model or its name.

forward(input_ids, padding_mask, **kwargs)[source]

Perform the forward pass of the DistilBERT model.

Parameters
  • input_ids (torch.Tensor) – The ids of each token in the input sequence. Is a tensor of shape [batch_size, max_seq_len]

  • padding_mask – A mask that assigns a 1 to valid input tokens and 0 to padding tokens of shape [batch_size, max_seq_len]

Returns

Embeddings for each token in the input sequence.

enable_hidden_states_output()[source]
disable_hidden_states_output()[source]
class farm.modeling.language_model.XLNet[source]

Bases: farm.modeling.language_model.LanguageModel

A XLNet model that wraps the HuggingFace’s implementation (https://github.com/huggingface/transformers) to fit the LanguageModel class. Paper: https://arxiv.org/abs/1906.08237

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(pretrained_model_name_or_path, language=None, **kwargs)[source]

Load a language model either by supplying

  • the name of a remote model on s3 (“xlnet-base-cased” …)

  • or a local path of a model trained via transformers (“some_dir/huggingface_model”)

  • or a local path of a model trained via FARM (“some_dir/farm_model”)

Parameters
  • pretrained_model_name_or_path – name or path of a model

  • language – (Optional) Name of language the model was trained for (e.g. “german”). If not supplied, FARM will try to infer it from the model name.

Returns

Language Model

forward(input_ids, segment_ids, padding_mask, **kwargs)[source]

Perform the forward pass of the XLNet model.

Parameters
  • input_ids (torch.Tensor) – The ids of each token in the input sequence. Is a tensor of shape [batch_size, max_seq_len]

  • segment_ids (torch.Tensor) – The id of the segment. For example, in next sentence prediction, the tokens in the first sentence are marked with 0 and those in the second are marked with 1. It is a tensor of shape [batch_size, max_seq_len]

  • padding_mask – A mask that assigns a 1 to valid input tokens and 0 to padding tokens of shape [batch_size, max_seq_len]

Returns

Embeddings for each token in the input sequence.

enable_hidden_states_output()[source]
disable_hidden_states_output()[source]

Prediction Head

class farm.modeling.prediction_head.PredictionHead[source]

Bases: torch.nn.modules.module.Module

Takes word embeddings from a language model and generates logits for a given task. Can also convert logits to loss and and logits to predictions.

classmethod create(prediction_head_name, layer_dims, class_weights=None)[source]

Create subclass of Prediction Head.

Parameters
  • prediction_head_name (str) – Classname (exact string!) of prediction head we want to create

  • layer_dims (List[Int]) – describing the feed forward block structure, e.g. [768,2]

  • class_weights (list[Float]) – The loss weighting to be assigned to certain label classes during training. Used to correct cases where there is a strong class imbalance.

Returns

Prediction Head of class prediction_head_name

save_config(save_dir, head_num=0)[source]

Saves the config as a json file.

Parameters
  • save_dir (str or Path) – Path to save config to

  • head_num (int) – Which head to save

save(save_dir, head_num=0)[source]

Saves the prediction head state dict.

Parameters
  • save_dir (str or Path) – path to save prediction head to

  • head_num (int) – which head to save

generate_config()[source]

Generates config file from Class parameters (only for sensible config parameters).

classmethod load(config_file, strict=True, load_weights=True)[source]

Loads a Prediction Head. Infers the class of prediction head from config_file.

Parameters
  • config_file (str) – location where corresponding config is stored

  • strict (bool) – whether to strictly enforce that the keys loaded from saved model match the ones in the PredictionHead (see torch.nn.module.load_state_dict()). Set to False for backwards compatibility with PHs saved with older version of FARM.

Returns

PredictionHead

Return type

PredictionHead[T]

logits_to_loss(logits, labels)[source]

Implement this function in your special Prediction Head. Should combine logits and labels with a loss fct to a per sample loss.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • labels (object) – labels, can vary in shape and type, depending on task

Returns

per sample loss as a torch.tensor of shape [batch_size]

logits_to_preds(logits)[source]

Implement this function in your special Prediction Head. Should combine turn logits into predictions.

Parameters

logits (object) – logits, can vary in shape and type, depending on task

Returns

predictions as a torch.tensor of shape [batch_size]

prepare_labels(**kwargs)[source]

Some prediction heads need additional label conversion. E.g. NER needs word level labels turned into subword token level labels.

Parameters

kwargs (object) – placeholder for passing generic parameters

Returns

labels in the right format

Return type

object

resize_input(input_dim)[source]

This function compares the output dimensionality of the language model against the input dimensionality of the prediction head. If there is a mismatch, the prediction head will be resized to fit.

class farm.modeling.prediction_head.RegressionHead(layer_dims=[768, 1], task_name='regression', **kwargs)[source]

Bases: farm.modeling.prediction_head.PredictionHead

__init__(layer_dims=[768, 1], task_name='regression', **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

forward(x)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

logits_to_loss(logits, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine logits and labels with a loss fct to a per sample loss.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • labels (object) – labels, can vary in shape and type, depending on task

Returns

per sample loss as a torch.tensor of shape [batch_size]

logits_to_preds(logits, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine turn logits into predictions.

Parameters

logits (object) – logits, can vary in shape and type, depending on task

Returns

predictions as a torch.tensor of shape [batch_size]

prepare_labels(**kwargs)[source]

Some prediction heads need additional label conversion. E.g. NER needs word level labels turned into subword token level labels.

Parameters

kwargs (object) – placeholder for passing generic parameters

Returns

labels in the right format

Return type

object

formatted_preds(logits, samples, **kwargs)[source]
class farm.modeling.prediction_head.TextClassificationHead(layer_dims=None, num_labels=None, class_weights=None, loss_ignore_index=-100, loss_reduction='none', task_name='text_classification', **kwargs)[source]

Bases: farm.modeling.prediction_head.PredictionHead

__init__(layer_dims=None, num_labels=None, class_weights=None, loss_ignore_index=-100, loss_reduction='none', task_name='text_classification', **kwargs)[source]
Parameters
  • layer_dims (list) – The size of the layers in the feed forward component. The feed forward will have as many layers as there are ints in this list. This param will be deprecated in future

  • num_labels (int) – The numbers of labels. Use to set the size of the final layer in the feed forward component. It is recommended to only set num_labels or layer_dims, not both.

  • class_weights

  • loss_ignore_index

  • loss_reduction

  • task_name

  • kwargs

classmethod load(pretrained_model_name_or_path)[source]

Load a prediction head from a saved FARM or transformers model. pretrained_model_name_or_path can be one of the following: a) Local path to a FARM prediction head config (e.g. my-bert/prediction_head_0_config.json) b) Local path to a Transformers model (e.g. my-bert) c) Name of a public model from https://huggingface.co/models (e.g. distilbert-base-uncased-distilled-squad)

Parameters

pretrained_model_name_or_path

local path of a saved model or name of a publicly available model. Exemplary public name: - deepset/bert-base-german-cased-hatespeech-GermEval18Coarse

See https://huggingface.co/models for full list

forward(X)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

logits_to_loss(logits, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine logits and labels with a loss fct to a per sample loss.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • labels (object) – labels, can vary in shape and type, depending on task

Returns

per sample loss as a torch.tensor of shape [batch_size]

logits_to_probs(logits, return_class_probs, **kwargs)[source]
logits_to_preds(logits, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine turn logits into predictions.

Parameters

logits (object) – logits, can vary in shape and type, depending on task

Returns

predictions as a torch.tensor of shape [batch_size]

prepare_labels(**kwargs)[source]

Some prediction heads need additional label conversion. E.g. NER needs word level labels turned into subword token level labels.

Parameters

kwargs (object) – placeholder for passing generic parameters

Returns

labels in the right format

Return type

object

formatted_preds(logits, samples, return_class_probs=False, **kwargs)[source]
class farm.modeling.prediction_head.MultiLabelTextClassificationHead(layer_dims=None, num_labels=None, class_weights=None, loss_reduction='none', task_name='text_classification', pred_threshold=0.5, **kwargs)[source]

Bases: farm.modeling.prediction_head.PredictionHead

__init__(layer_dims=None, num_labels=None, class_weights=None, loss_reduction='none', task_name='text_classification', pred_threshold=0.5, **kwargs)[source]
Parameters
  • layer_dims (list) – The size of the layers in the feed forward component. The feed forward will have as many layers as there are ints in this list. This param will be deprecated in future

  • num_labels (int) – The numbers of labels. Use to set the size of the final layer in the feed forward component. It is recommended to only set num_labels or layer_dims, not both.

  • class_weights

  • loss_reduction

  • task_name

  • pred_threshold

  • kwargs

forward(X)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

logits_to_loss(logits, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine logits and labels with a loss fct to a per sample loss.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • labels (object) – labels, can vary in shape and type, depending on task

Returns

per sample loss as a torch.tensor of shape [batch_size]

logits_to_probs(logits, **kwargs)[source]
logits_to_preds(logits, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine turn logits into predictions.

Parameters

logits (object) – logits, can vary in shape and type, depending on task

Returns

predictions as a torch.tensor of shape [batch_size]

prepare_labels(**kwargs)[source]

Some prediction heads need additional label conversion. E.g. NER needs word level labels turned into subword token level labels.

Parameters

kwargs (object) – placeholder for passing generic parameters

Returns

labels in the right format

Return type

object

formatted_preds(logits, samples, **kwargs)[source]
class farm.modeling.prediction_head.TokenClassificationHead(layer_dims=None, num_labels=None, task_name='ner', **kwargs)[source]

Bases: farm.modeling.prediction_head.PredictionHead

__init__(layer_dims=None, num_labels=None, task_name='ner', **kwargs)[source]
Parameters
  • layer_dims (list) – The size of the layers in the feed forward component. The feed forward will have as many layers as there are ints in this list. This param will be deprecated in future

  • num_labels (int) – The numbers of labels. Use to set the size of the final layer in the feed forward component. It is recommended to only set num_labels or layer_dims, not both.

  • task_name

  • kwargs

classmethod load(pretrained_model_name_or_path)[source]

Load a prediction head from a saved FARM or transformers model. pretrained_model_name_or_path can be one of the following: a) Local path to a FARM prediction head config (e.g. my-bert/prediction_head_0_config.json) b) Local path to a Transformers model (e.g. my-bert) c) Name of a public model from https://huggingface.co/models (e.g.bert-base-cased-finetuned-conll03-english)

Parameters

pretrained_model_name_or_path

local path of a saved model or name of a publicly available model. Exemplary public names: - bert-base-cased-finetuned-conll03-english

See https://huggingface.co/models for full list

forward(X)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

logits_to_loss(logits, initial_mask, padding_mask=None, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine logits and labels with a loss fct to a per sample loss.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • labels (object) – labels, can vary in shape and type, depending on task

Returns

per sample loss as a torch.tensor of shape [batch_size]

logits_to_preds(logits, initial_mask, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine turn logits into predictions.

Parameters

logits (object) – logits, can vary in shape and type, depending on task

Returns

predictions as a torch.tensor of shape [batch_size]

logits_to_probs(logits, initial_mask, return_class_probs, **kwargs)[source]
prepare_labels(initial_mask, **kwargs)[source]

Some prediction heads need additional label conversion. E.g. NER needs word level labels turned into subword token level labels.

Parameters

kwargs (object) – placeholder for passing generic parameters

Returns

labels in the right format

Return type

object

static initial_token_only(seq, initial_mask)[source]
formatted_preds(logits, initial_mask, samples, return_class_probs=False, **kwargs)[source]
class farm.modeling.prediction_head.BertLMHead(hidden_size, vocab_size, hidden_act='gelu', task_name='lm', **kwargs)[source]

Bases: farm.modeling.prediction_head.PredictionHead

__init__(hidden_size, vocab_size, hidden_act='gelu', task_name='lm', **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod load(pretrained_model_name_or_path, n_added_tokens=0)[source]

Load a prediction head from a saved FARM or transformers model. pretrained_model_name_or_path can be one of the following: a) Local path to a FARM prediction head config (e.g. my-bert/prediction_head_0_config.json) b) Local path to a Transformers model (e.g. my-bert) c) Name of a public model from https://huggingface.co/models (e.g.bert-base-cased)

Parameters

pretrained_model_name_or_path

local path of a saved model or name of a publicly available model. Exemplary public names: - bert-base-cased

See https://huggingface.co/models for full list

set_shared_weights(shared_embedding_weights)[source]
forward(hidden_states)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

logits_to_loss(logits, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine logits and labels with a loss fct to a per sample loss.

Parameters
  • logits (object) – logits, can vary in shape and type, depending on task

  • labels (object) – labels, can vary in shape and type, depending on task

Returns

per sample loss as a torch.tensor of shape [batch_size]

logits_to_preds(logits, **kwargs)[source]

Implement this function in your special Prediction Head. Should combine turn logits into predictions.

Parameters

logits (object) – logits, can vary in shape and type, depending on task

Returns

predictions as a torch.tensor of shape [batch_size]

prepare_labels(**kwargs)[source]

Some prediction heads need additional label conversion. E.g. NER needs word level labels turned into subword token level labels.

Parameters

kwargs (object) – placeholder for passing generic parameters

Returns

labels in the right format

Return type

object

class farm.modeling.prediction_head.NextSentenceHead(layer_dims=None, num_labels=None, class_weights=None, loss_ignore_index=-100, loss_reduction='none', task_name='text_classification', **kwargs)[source]

Bases: farm.modeling.prediction_head.TextClassificationHead

Almost identical to a TextClassificationHead. Only difference: we can load the weights from

a pretrained language model that was saved in the Transformers style (all in one model).

classmethod load(pretrained_model_name_or_path)[source]

Load a prediction head from a saved FARM or transformers model. pretrained_model_name_or_path can be one of the following: a) Local path to a FARM prediction head config (e.g. my-bert/prediction_head_0_config.json) b) Local path to a Transformers model (e.g. my-bert) c) Name of a public model from https://huggingface.co/models (e.g.bert-base-cased)

Parameters

pretrained_model_name_or_path

local path of a saved model or name of a publicly available model. Exemplary public names: - bert-base-cased

See https://huggingface.co/models for full list

class farm.modeling.prediction_head.FeedForwardBlock(layer_dims, **kwargs)[source]

Bases: torch.nn.modules.module.Module

A feed forward neural network of variable depth and width.

__init__(layer_dims, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

forward(X)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class farm.modeling.prediction_head.QuestionAnsweringHead(layer_dims=[768, 2], task_name='question_answering', no_ans_boost=0.0, context_window_size=100, n_best=5, n_best_per_sample=1, **kwargs)[source]

Bases: farm.modeling.prediction_head.PredictionHead

A question answering head predicts the start and end of the answer on token level.

__init__(layer_dims=[768, 2], task_name='question_answering', no_ans_boost=0.0, context_window_size=100, n_best=5, n_best_per_sample=1, **kwargs)[source]
Parameters
  • layer_dims (List[Int]) – dimensions of Feed Forward block, e.g. [768,2], for adjusting to BERT embedding. Output should be always 2

  • kwargs (object) – placeholder for passing generic parameters

  • no_ans_boost (float) – How much the no_answer logit is boosted/increased. The higher the value, the more likely a “no answer possible given the input text” is returned by the model

  • context_window_size (int) – The size, in characters, of the window around the answer span that is used when displaying the context around the answer.

  • n_best (int) – The number of positive answer spans for each document.

  • n_best_per_sample (int) – num candidate answer spans to consider from each passage. Each passage also returns “no answer” info. This is decoupled from n_best on document level, since predictions on passage level are very similar. It should have a low value

classmethod load(pretrained_model_name_or_path)[source]

Load a prediction head from a saved FARM or transformers model. pretrained_model_name_or_path can be one of the following: a) Local path to a FARM prediction head config (e.g. my-bert/prediction_head_0_config.json) b) Local path to a Transformers model (e.g. my-bert) c) Name of a public model from https://huggingface.co/models (e.g. distilbert-base-uncased-distilled-squad)

Parameters

pretrained_model_name_or_path

local path of a saved model or name of a publicly available model. Exemplary public names: - distilbert-base-uncased-distilled-squad - bert-large-uncased-whole-word-masking-finetuned-squad

See https://huggingface.co/models for full list

forward(X)[source]

One forward pass through the prediction head model, starting with language model output on token level

logits_to_loss(logits, labels, **kwargs)[source]

Combine predictions and labels to a per sample loss.

logits_to_preds(logits, padding_mask, start_of_word, seq_2_start_t, max_answer_length=1000, **kwargs)[source]

Get the predicted index of start and end token of the answer. Note that the output is at token level and not word level. Note also that these logits correspond to the tokens of a sample (i.e. special tokens, question tokens, passage_tokens)

get_top_candidates(sorted_candidates, start_end_matrix, n_non_padding, max_answer_length, seq_2_start_t)[source]

Returns top candidate answers. Operates on a matrix of summed start and end logits. This matrix corresponds to a single sample (includes special tokens, question tokens, passage tokens). This method always returns a list of len n_best + 1 (it is comprised of the n_best positive answers along with the one no_answer)

static valid_answer_idxs(start_idx, end_idx, n_non_padding, max_answer_length, seq_2_start_t)[source]

Returns True if the supplied index span is a valid prediction. The indices being provided should be on sample/passage level (special tokens + question_tokens + passag_tokens) and not document level

formatted_preds(logits, preds_p, baskets, rest_api_schema=False)[source]

Takes a list of predictions, each corresponding to one sample, and converts them into document level predictions. Leverages information in the SampleBaskets. Assumes that we are being passed predictions from ALL samples in the one SampleBasket i.e. all passages of a document. Logits should be None, because we have already converted the logits to predictions before calling formatted_preds.

stringify(top_preds, baskets)[source]

Turn prediction spans into strings

to_rest_api_schema(formatted_preds, no_ans_gaps, baskets)[source]
answer_for_api(top_preds, basket)[source]
create_context(ans_start_ch, ans_end_ch, clear_text)[source]
static span_to_string(start_t, end_t, token_offsets, clear_text)[source]
has_no_answer_idxs(sample_top_n)[source]
aggregate_preds(preds, passage_start_t, ids, seq_2_start_t=None, labels=None)[source]

Aggregate passage level predictions to create document level predictions. This method assumes that all passages of each document are contained in preds i.e. that there are no incomplete documents. The output of this step are prediction spans. No answer is represented by a (-1, -1) span on the document level

static reduce_labels(labels)[source]

Removes repeat answers. Represents a no answer label as (-1,-1)

reduce_preds(preds)[source]

This function contains the logic for choosing the best answers from each passage. In the end, it returns the n_best predictions on the document level.

static deduplicate(flat_pos_answers)[source]
static get_no_answer_score(preds)[source]
static pred_to_doc_idxs(pred, passage_start_t)[source]

Converts the passage level predictions to document level predictions. Note that on the doc level we don’t have special tokens or question tokens. This means that a no answer cannot be prepresented by a (0,0) span but will instead be represented by (-1, -1)

static label_to_doc_idxs(label, passage_start_t)[source]

Converts the passage level labels to document level labels. Note that on the doc level we don’t have special tokens or question tokens. This means that a no answer cannot be prepresented by a (0,0) span but will instead be represented by (-1, -1)

prepare_labels(labels, start_of_word, **kwargs)[source]

Some prediction heads need additional label conversion. E.g. NER needs word level labels turned into subword token level labels.

Parameters

kwargs (object) – placeholder for passing generic parameters

Returns

labels in the right format

Return type

object

Optimization

class farm.modeling.optimization.WrappedDataParallel(module, device_ids=None, output_device=None, dim=0)[source]

Bases: torch.nn.parallel.data_parallel.DataParallel

A way of adapting attributes of underlying class to parallel mode. See: https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html#dataparallel

Gets into recursion errors. Workaround see: https://discuss.pytorch.org/t/access-att-of-model-wrapped-within-torch-nn-dataparallel-maximum-recursion-depth-exceeded/46975

class farm.modeling.optimization.WrappedDDP(module, device_ids=None, output_device=None, dim=0, broadcast_buffers=True, process_group=None, bucket_cap_mb=25, find_unused_parameters=False, check_reduction=False)[source]

Bases: torch.nn.parallel.distributed.DistributedDataParallel

A way of adapting attributes of underlying class to distributed mode. Same as in WrappedDataParallel above. Even when using distributed on a single machine with multiple GPUs, apex can speed up training significantly. Distributed code must be launched with “python -m torch.distributed.launch –nproc_per_node=1 run_script.py”

farm.modeling.optimization.initialize_optimizer(model, n_batches, n_epochs, device, learning_rate, optimizer_opts=None, schedule_opts=None, distributed=False, grad_acc_steps=1, local_rank=-1, use_amp=None)[source]

Initializes an optimizer, a learning rate scheduler and converts the model if needed (e.g for mixed precision). Per default, we use transformers’ AdamW and a linear warmup schedule with warmup ratio 0.1. You can easily switch optimizer and schedule via optimizer_opts and schedule_opts.

Parameters
  • model (AdaptiveModel) – model to optimize (e.g. trimming weights to fp16 / mixed precision)

  • n_batches (int) – number of batches for training

  • n_epochs – number of epochs for training

  • device

  • learning_rate (float) – Learning rate

  • optimizer_opts – Dict to customize the optimizer. Choose any optimizer available from torch.optim, apex.optimizers or transformers.optimization by supplying the class name and the parameters for the constructor. Examples: 1) AdamW from Transformers (Default): {“name”: “TransformersAdamW”, “correct_bias”: False, “weight_decay”: 0.01} 2) SGD from pytorch: {“name”: “SGD”, “momentum”: 0.0} 3) FusedLAMB from apex: {“name”: “FusedLAMB”, “bias_correction”: True}

  • schedule_opts – Dict to customize the learning rate schedule. Choose any Schedule from Pytorch or Huggingface’s Transformers by supplying the class name and the parameters needed by the constructor. Examples: 1) Linear Warmup (Default): {“name”: “LinearWarmup”, “num_warmup_steps”: 0.1 * num_training_steps, “num_training_steps”: num_training_steps} 2) CosineWarmup: {“name”: “CosineWarmup”, “num_warmup_steps”: 0.1 * num_training_steps, “num_training_steps”: num_training_steps} 3) CyclicLR from pytorch: {“name”: “CyclicLR”, “base_lr”: 1e-5, “max_lr”:1e-4, “step_size_up”: 100}

  • distributed – Whether training on distributed machines

  • grad_acc_steps – Number of steps to accumulate gradients for. Helpful to mimic large batch_sizes on small machines.

  • local_rank – rank of the machine in a distributed setting

  • use_amp – Optimization level of nvidia’s automatic mixed precision (AMP). The higher the level, the faster the model. Options: “O0” (Normal FP32 training) “O1” (Mixed Precision => Recommended) “O2” (Almost FP16) “O3” (Pure FP16). See details on: https://nvidia.github.io/apex/amp.html

Returns

model, optimizer, scheduler

farm.modeling.optimization.get_scheduler(optimizer, opts)[source]

Get the scheduler based on dictionary with options. Options are passed to the scheduler constructor.

Parameters
  • optimizer – optimizer whose learning rate to control

  • opts – dictionary of args to be passed to constructor of schedule

Returns

created scheduler

farm.modeling.optimization.calculate_optimization_steps(n_batches, grad_acc_steps, n_epochs, local_rank)[source]

Tokenization

Tokenization classes.

class farm.modeling.tokenization.Tokenizer[source]

Bases: object

Simple Wrapper for Tokenizers from the transformers package. Enables loading of different Tokenizer classes with a uniform interface.

classmethod load(pretrained_model_name_or_path, tokenizer_class=None, **kwargs)[source]

Enables loading of different Tokenizer classes with a uniform interface. Either infer the class from pretrained_model_name_or_path or define it manually via tokenizer_class.

Parameters
  • pretrained_model_name_or_path (str) – The path of the saved pretrained model or its name (e.g. bert-base-uncased)

  • tokenizer_class (str) – (Optional) Name of the tokenizer class to load (e.g. BertTokenizer)

  • kwargs

Returns

Tokenizer

farm.modeling.tokenization.tokenize_with_metadata(text, tokenizer)[source]

Performing tokenization while storing some important metadata for each token:

  • offsets: (int) Character index where the token begins in the original text

  • start_of_word: (bool) If the token is the start of a word. Particularly helpful for NER and QA tasks.

We do this by first doing whitespace tokenization and then applying the model specific tokenizer to each “word”.

Note

We don’t assume to preserve exact whitespaces in the tokens! This means: tabs, new lines, multiple whitespace etc will all resolve to a single ” “. This doesn’t make a difference for BERT + XLNet but it does for RoBERTa. For RoBERTa it has the positive effect of a shorter sequence length, but some information about whitespace type is lost which might be helpful for certain NLP tasks ( e.g tab for tables).

Parameters
  • text (str) – Text to tokenize

  • tokenizer – Tokenizer (e.g. from Tokenizer.load())

Returns

Dictionary with “tokens”, “offsets” and “start_of_word”

Return type

dict

farm.modeling.tokenization.truncate_sequences(seq_a, seq_b, tokenizer, max_seq_len, truncation_strategy='longest_first', with_special_tokens=True, stride=0)[source]

Reduces a single sequence or a pair of sequences to a maximum sequence length. The sequences can contain tokens or any other elements (offsets, masks …). If with_special_tokens is enabled, it’ll remove some additional tokens to have exactly enough space for later adding special tokens (CLS, SEP etc.)

Supported truncation strategies:

  • longest_first: (default) Iteratively reduce the inputs sequence until the input is under max_length starting from the longest one at each token (when there is a pair of input sequences). Overflowing tokens only contains overflow from the first sequence.

  • only_first: Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.

  • only_second: Only truncate the second sequence

  • do_not_truncate: Does not truncate (raise an error if the input sequence is longer than max_length)

Parameters
  • seq_a (list) – First sequence of tokens/offsets/…

  • seq_b (None or list) – Optional second sequence of tokens/offsets/…

  • tokenizer – Tokenizer (e.g. from Tokenizer.load())

  • max_seq_len (int) –

  • truncation_strategy (str) – how the sequence(s) should be truncated down. Default: “longest_first” (see above for other options).

  • with_special_tokens (bool) – If true, it’ll remove some additional tokens to have exactly enough space for later adding special tokens (CLS, SEP etc.)

  • stride (int) – optional stride of the window during truncation

Returns

truncated seq_a, truncated seq_b, overflowing tokens

farm.modeling.tokenization.insert_at_special_tokens_pos(seq, special_tokens_mask, insert_element)[source]

Adds elements to a sequence at the positions that align with special tokens. This is useful for expanding label ids or masks, so that they align with corresponding tokens (incl. the special tokens)

Example:

# Tokens:  ["CLS", "some", "words","SEP"]
>>> special_tokens_mask =  [1,0,0,1]
>>> lm_label_ids =  [12,200]
>>> insert_at_special_tokens_pos(lm_label_ids, special_tokens_mask, insert_element=-1)
[-1, 12, 200, -1]
Parameters
  • seq (list) – List where you want to insert new elements

  • special_tokens_mask (list) – list with “1” for positions of special chars

  • insert_element – the value you want to insert

Returns

list