Adaptive Model

Language Model

Prediction Head


class farm.modeling.optimization.WrappedDataParallel(module, device_ids=None, output_device=None, dim=0)[source]

Bases: torch.nn.parallel.data_parallel.DataParallel

A way of adapting attributes of underlying class to parallel mode. See:

Gets into recursion errors. Workaround see:

class farm.modeling.optimization.WrappedDDP(module, device_ids=None, output_device=None, dim=0, broadcast_buffers=True, process_group=None, bucket_cap_mb=25, find_unused_parameters=False, check_reduction=False)[source]

Bases: torch.nn.parallel.distributed.DistributedDataParallel

A way of adapting attributes of underlying class to distributed mode. Same as in WrappedDataParallel above. Even when using distributed on a single machine with multiple GPUs, apex can speed up training significantly. Distributed code must be launched with “python -m torch.distributed.launch –nproc_per_node=1”

farm.modeling.optimization.initialize_optimizer(model, n_batches, n_epochs, device, learning_rate, optimizer_opts=None, schedule_opts=None, distributed=False, grad_acc_steps=1, local_rank=-1, use_amp=None)[source]

Initializes an optimizer, a learning rate scheduler and converts the model if needed (e.g for mixed precision). Per default, we use transformers’ AdamW and a linear warmup schedule with warmup ratio 0.1. You can easily switch optimizer and schedule via optimizer_opts and schedule_opts.

  • model (AdaptiveModel) – model to optimize (e.g. trimming weights to fp16 / mixed precision)

  • n_batches (int) – number of batches for training

  • n_epochs – number of epochs for training

  • device

  • learning_rate (float) – Learning rate

  • optimizer_opts – Dict to customize the optimizer. Choose any optimizer available from torch.optim, apex.optimizers or transformers.optimization by supplying the class name and the parameters for the constructor. Examples: 1) AdamW from Transformers (Default): {“name”: “TransformersAdamW”, “correct_bias”: False, “weight_decay”: 0.01} 2) SGD from pytorch: {“name”: “SGD”, “momentum”: 0.0} 3) FusedLAMB from apex: {“name”: “FusedLAMB”, “bias_correction”: True}

  • schedule_opts – Dict to customize the learning rate schedule. Choose any Schedule from Pytorch or Huggingface’s Transformers by supplying the class name and the parameters needed by the constructor. Examples: 1) Linear Warmup (Default): {“name”: “LinearWarmup”, “num_warmup_steps”: 0.1 * num_training_steps, “num_training_steps”: num_training_steps} 2) CosineWarmup: {“name”: “CosineWarmup”, “num_warmup_steps”: 0.1 * num_training_steps, “num_training_steps”: num_training_steps} 3) CyclicLR from pytorch: {“name”: “CyclicLR”, “base_lr”: 1e-5, “max_lr”:1e-4, “step_size_up”: 100}

  • distributed – Whether training on distributed machines

  • grad_acc_steps – Number of steps to accumulate gradients for. Helpful to mimic large batch_sizes on small machines.

  • local_rank – rank of the machine in a distributed setting

  • use_amp – Optimization level of nvidia’s automatic mixed precision (AMP). The higher the level, the faster the model. Options: “O0” (Normal FP32 training) “O1” (Mixed Precision => Recommended) “O2” (Almost FP16) “O3” (Pure FP16). See details on:


model, optimizer, scheduler

farm.modeling.optimization.get_scheduler(optimizer, opts)[source]

Get the scheduler based on dictionary with options. Options are passed to the scheduler constructor.

  • optimizer – optimizer whose learning rate to control

  • opts – dictionary of args to be passed to constructor of schedule


created scheduler

farm.modeling.optimization.calculate_optimization_steps(n_batches, grad_acc_steps, n_epochs, local_rank)[source]


Tokenization classes.

class farm.modeling.tokenization.Tokenizer[source]

Bases: object

Simple Wrapper for Tokenizers from the transformers package. Enables loading of different Tokenizer classes with a uniform interface.

classmethod load(pretrained_model_name_or_path, tokenizer_class=None, **kwargs)[source]

Enables loading of different Tokenizer classes with a uniform interface. Either infer the class from pretrained_model_name_or_path or define it manually via tokenizer_class.

  • pretrained_model_name_or_path (str) – The path of the saved pretrained model or its name (e.g. bert-base-uncased)

  • tokenizer_class (str) – (Optional) Name of the tokenizer class to load (e.g. BertTokenizer)

  • kwargs



farm.modeling.tokenization.tokenize_with_metadata(text, tokenizer)[source]

Performing tokenization while storing some important metadata for each token:

  • offsets: (int) Character index where the token begins in the original text

  • start_of_word: (bool) If the token is the start of a word. Particularly helpful for NER and QA tasks.

We do this by first doing whitespace tokenization and then applying the model specific tokenizer to each “word”.


We don’t assume to preserve exact whitespaces in the tokens! This means: tabs, new lines, multiple whitespace etc will all resolve to a single ” “. This doesn’t make a difference for BERT + XLNet but it does for RoBERTa. For RoBERTa it has the positive effect of a shorter sequence length, but some information about whitespace type is lost which might be helpful for certain NLP tasks ( e.g tab for tables).

  • text (str) – Text to tokenize

  • tokenizer – Tokenizer (e.g. from Tokenizer.load())


Dictionary with “tokens”, “offsets” and “start_of_word”

Return type


farm.modeling.tokenization.truncate_sequences(seq_a, seq_b, tokenizer, max_seq_len, truncation_strategy='longest_first', with_special_tokens=True, stride=0)[source]

Reduces a single sequence or a pair of sequences to a maximum sequence length. The sequences can contain tokens or any other elements (offsets, masks …). If with_special_tokens is enabled, it’ll remove some additional tokens to have exactly enough space for later adding special tokens (CLS, SEP etc.)

Supported truncation strategies:

  • longest_first: (default) Iteratively reduce the inputs sequence until the input is under max_length starting from the longest one at each token (when there is a pair of input sequences). Overflowing tokens only contains overflow from the first sequence.

  • only_first: Only truncate the first sequence. raise an error if the first sequence is shorter or equal to than num_tokens_to_remove.

  • only_second: Only truncate the second sequence

  • do_not_truncate: Does not truncate (raise an error if the input sequence is longer than max_length)

  • seq_a (list) – First sequence of tokens/offsets/…

  • seq_b (None or list) – Optional second sequence of tokens/offsets/…

  • tokenizer – Tokenizer (e.g. from Tokenizer.load())

  • max_seq_len (int) –

  • truncation_strategy (str) – how the sequence(s) should be truncated down. Default: “longest_first” (see above for other options).

  • with_special_tokens (bool) – If true, it’ll remove some additional tokens to have exactly enough space for later adding special tokens (CLS, SEP etc.)

  • stride (int) – optional stride of the window during truncation


truncated seq_a, truncated seq_b, overflowing tokens

farm.modeling.tokenization.insert_at_special_tokens_pos(seq, special_tokens_mask, insert_element)[source]

Adds elements to a sequence at the positions that align with special tokens. This is useful for expanding label ids or masks, so that they align with corresponding tokens (incl. the special tokens)


# Tokens:  ["CLS", "some", "words","SEP"]
>>> special_tokens_mask =  [1,0,0,1]
>>> lm_label_ids =  [12,200]
>>> insert_at_special_tokens_pos(lm_label_ids, special_tokens_mask, insert_element=-1)
[-1, 12, 200, -1]
  • seq (list) – List where you want to insert new elements

  • special_tokens_mask (list) – list with “1” for positions of special chars

  • insert_element – the value you want to insert