Data Handling¶

Processor¶

farm.data_handler.processor.random_float()¶

random(size=None)

Return random floats in the half-open interval [0.0, 1.0). Alias for random_sample to ease forward-porting to the new random API.

class farm.data_handler.processor.Processor(tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, data_dir, tasks={}, proxies=None, multithreading_rust=True)[source]¶

Bases: abc.ABC

Is used to generate PyTorch Datasets from input data. An implementation of this abstract class should be created for each new data source. Implement the abstract methods: file_to_dicts(), _dict_to_samples(), _sample_to_features() to be compatible with your data format

__init__(tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, data_dir, tasks={}, proxies=None, multithreading_rust=True)[source]¶

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
data_dir (str) – The directory in which the train, test and perhaps dev files can be found.
tasks (dict) – Tasks for which the processor shall extract labels from the input data. Usually this includes a single, default task, e.g. text classification. In a multitask setting this includes multiple tasks, e.g. 2x text classification. The task name will be used to connect with the related PredictionHead.
proxies (dict) – proxy configuration to allow downloads of remote datasets. Format as in “requests” library: https://2.python-requests.org//en/latest/user/advanced/#proxies
multithreading_rust (bool) – Whether to allow multithreading in Rust, e.g. for FastTokenizers. Note: Enabling multithreading in Rust AND multiprocessing in python might cause deadlocks.

classmethod load(processor_name, data_dir, tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, **kwargs)[source]¶

Loads the class of processor specified by processor name.

Parameters

processor_name (str) – The class of processor to be loaded.
data_dir (str) – Directory where data files are located.
tokenizer – A tokenizer object
max_seq_len (int) – Sequences longer than this will be truncated.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
kwargs (object) – placeholder for passing generic parameters

Returns

An instance of the specified processor.

classmethod load_from_dir(load_dir)[source]¶

Infers the specific type of Processor from a config file (e.g. GNADProcessor) and loads an instance of it.

Parameters: load_dir – str, directory that contains a ‘processor_config.json’
Returns: An instance of a Processor Subclass (e.g. GNADProcessor)

classmethod convert_from_transformers(tokenizer_name_or_path, task_type, max_seq_len, doc_stride, revision=None, tokenizer_class=None, tokenizer_args=None, use_fast=True)[source]¶

save(save_dir)[source]¶

Saves the vocabulary to file and also creates a json file containing all the information needed to load the same processor.

Parameters: save_dir (str) – Directory where the files are to be saved

generate_config()[source]¶: Generates config file from Class and instance attributes (only for sensible config parameters).

add_task(name, metric, label_list, label_column_name=None, label_name=None, task_type=None, text_column_name=None)[source]¶

abstract file_to_dicts(file: str) → [<class 'dict'>][source]¶

static log_problematic(problematic_sample_ids)[source]¶

dataset_from_dicts(dicts, indices=None, return_baskets=False)[source]¶

Contains all the functionality to turn a list of dict objects into a PyTorch Dataset and a list of tensor names. This can be used for inference mode.

Parameters: dicts (list of dicts) – List of dictionaries where each contains the data of one input sample.
Returns: a Pytorch dataset and a list of tensor names.

class farm.data_handler.processor.TextClassificationProcessor(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='t', quote_char="'", skiprows=None, label_column_name='label', multilabel=False, header=0, proxies=None, max_samples=None, text_column_name='text', **kwargs)[source]¶

Bases: farm.data_handler.processor.Processor

Used to handle the text classification datasets that come in tabular format (CSV, TSV, etc.)

__init__(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='\t', quote_char="'", skiprows=None, label_column_name='label', multilabel=False, header=0, proxies=None, max_samples=None, text_column_name='text', **kwargs)[source]¶

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) – The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
label_list (list) – list of labels to predict (strings). For most cases this should be: [“start_token”, “end_token”]
metric (str, function, or list) – name of metric that shall be used for evaluation, e.g. “acc” or “f1_macro”. Alternatively you can also supply a custom function, that takes preds and labels as args and returns a numerical value. For using multiple metrics supply them as a list, e.g [“acc”, my_custom_metric_fn].
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
delimiter (str) – Separator used in the input tsv / csv file
quote_char (str) – Character used for quoting strings in the input tsv/ csv file
skiprows (int) – number of rows to skip in the tsvs (e.g. for multirow headers)
label_column_name (str) – name of the column in the input csv/tsv that shall be used as training labels
multilabel (bool) – set to True for multilabel classification
header (int) – which line to use as a header in the input csv/tsv
proxies (dict) – proxy configuration to allow downloads of remote datasets. Format as in “requests” library: https://2.python-requests.org//en/latest/user/advanced/#proxies
text_column_name (str) – name of the column in the input csv/tsv that shall be used as training text
kwargs (object) – placeholder for passing generic parameters

file_to_dicts(file: str) → [<class 'dict'>][source]¶

dataset_from_dicts(dicts, indices=None, return_baskets=False, debug=False)[source]¶

Contains all the functionality to turn a list of dict objects into a PyTorch Dataset and a list of tensor names. This can be used for inference mode.

Parameters: dicts (list of dicts) – List of dictionaries where each contains the data of one input sample.
Returns: a Pytorch dataset and a list of tensor names.

convert_labels(dictionary)[source]¶

class farm.data_handler.processor.RegressionProcessor(tokenizer, max_seq_len, data_dir, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='t', quote_char="'", skiprows=None, label_column_name='label', label_name='regression_label', scaler_mean=None, scaler_scale=None, proxies=None, text_column_name='text', **kwargs)[source]¶

Bases: farm.data_handler.processor.TextClassificationProcessor

Processor to handle a regression dataset in tab separated text + label, It uses the text conversion functionality of a TextClassificationProcessor but adds special label conversion (scaled float value as label)

__init__(tokenizer, max_seq_len, data_dir, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='\t', quote_char="'", skiprows=None, label_column_name='label', label_name='regression_label', scaler_mean=None, scaler_scale=None, proxies=None, text_column_name='text', **kwargs)[source]¶

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) –
The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
label_list (list) – list of labels to predict (strings). For most cases this should be: [“start_token”, “end_token”]
metric (str, function, or list) – name of metric that shall be used for evaluation, e.g. “acc” or “f1_macro”. Alternatively you can also supply a custom function, that takes preds and labels as args and returns a numerical value. For using multiple metrics supply them as a list, e.g [“acc”, my_custom_metric_fn].
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
delimiter (str) – Separator used in the input tsv / csv file
quote_char (str) – Character used for quoting strings in the input tsv/ csv file
skiprows (int) – number of rows to skip in the tsvs (e.g. for multirow headers)
label_column_name (str) – name of the column in the input csv/tsv that shall be used as training labels
label_name (str) – name for the internal label variable in FARM (only needed to adjust in rare cases)
scaler_mean (float) – Value to substract from the label for normalization
scaler_scale (float) – Value to divide the label by for normalization
proxies (dict) – proxy configuration to allow downloads of remote datasets. Format as in “requests” library: https://2.python-requests.org//en/latest/user/advanced/#proxies
text_column_name (str) – name of the column in the input csv/tsv that shall be used as training text
kwargs (object) – placeholder for passing generic parameters

file_to_dicts(file: str) → [<class 'dict'>][source]¶

convert_labels(dictionary: dict)[source]¶

class farm.data_handler.processor.InferenceProcessor(tokenizer, max_seq_len, **kwargs)[source]¶

Bases: farm.data_handler.processor.TextClassificationProcessor

Generic processor used at inference time: - fast - no labels - pure encoding of text into pytorch dataset - Doesn’t read from file, but only consumes dictionaries (e.g. coming from API requests)

__init__(tokenizer, max_seq_len, **kwargs)[source]¶

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) –
The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
label_list (list) – list of labels to predict (strings). For most cases this should be: [“start_token”, “end_token”]
metric (str, function, or list) – name of metric that shall be used for evaluation, e.g. “acc” or “f1_macro”. Alternatively you can also supply a custom function, that takes preds and labels as args and returns a numerical value. For using multiple metrics supply them as a list, e.g [“acc”, my_custom_metric_fn].
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
delimiter (str) – Separator used in the input tsv / csv file
quote_char (str) – Character used for quoting strings in the input tsv/ csv file
skiprows (int) – number of rows to skip in the tsvs (e.g. for multirow headers)
label_column_name (str) – name of the column in the input csv/tsv that shall be used as training labels
multilabel (bool) – set to True for multilabel classification
header (int) – which line to use as a header in the input csv/tsv
proxies (dict) – proxy configuration to allow downloads of remote datasets. Format as in “requests” library: https://2.python-requests.org//en/latest/user/advanced/#proxies
text_column_name (str) – name of the column in the input csv/tsv that shall be used as training text
kwargs (object) – placeholder for passing generic parameters

classmethod load_from_dir(load_dir)[source]¶

Overwriting method from parent class to always load the InferenceProcessor instead of the specific class stored in the config.

Parameters: load_dir – str, directory that contains a ‘processor_config.json’
Returns: An instance of an InferenceProcessor

file_to_dicts(file: str) → [<class 'dict'>][source]¶

convert_labels(dictionary: dict)[source]¶

dataset_from_dicts(dicts, indices=None, return_baskets=False, debug=False)[source]¶: Function to convert input dictionaries containing text into a torch dataset. For normal operation with Language Models it calls the superclass’ TextClassification.dataset_from_dicts method. For slow tokenizers, s3e or wordembedding tokenizers the function works on _dict_to_samples and _sample_to_features

class farm.data_handler.processor.TextPairClassificationProcessor(**kwargs)[source]¶

Bases: farm.data_handler.processor.TextClassificationProcessor

Used to handle text pair classification datasets (e.g. Answer Selection or Natural Inference) that come in tsv format. The columns should be called text, text_b and label.

__init__(**kwargs)[source]¶

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) –
The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
label_list (list) – list of labels to predict (strings). For most cases this should be: [“start_token”, “end_token”]
metric (str, function, or list) – name of metric that shall be used for evaluation, e.g. “acc” or “f1_macro”. Alternatively you can also supply a custom function, that takes preds and labels as args and returns a numerical value. For using multiple metrics supply them as a list, e.g [“acc”, my_custom_metric_fn].
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
delimiter (str) – Separator used in the input tsv / csv file
quote_char (str) – Character used for quoting strings in the input tsv/ csv file
skiprows (int) – number of rows to skip in the tsvs (e.g. for multirow headers)
label_column_name (str) – name of the column in the input csv/tsv that shall be used as training labels
multilabel (bool) – set to True for multilabel classification
header (int) – which line to use as a header in the input csv/tsv
proxies (dict) – proxy configuration to allow downloads of remote datasets. Format as in “requests” library: https://2.python-requests.org//en/latest/user/advanced/#proxies
text_column_name (str) – name of the column in the input csv/tsv that shall be used as training text
kwargs (object) – placeholder for passing generic parameters

file_to_dicts(file: str) → [<class 'dict'>][source]¶

class farm.data_handler.processor.NERProcessor(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, delimiter='t', proxies=None, **kwargs)[source]¶

Bases: farm.data_handler.processor.Processor

Used to handle most NER datasets, like CoNLL or GermEval 2014

__init__(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, delimiter='\t', proxies=None, **kwargs)[source]¶

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) –
The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
label_list (list) – list of labels to predict (strings).
metric (str, function, or list) – name of metric that shall be used for evaluation, e.g. “seq_f1”. Alternatively you can also supply a custom function, that takes preds and labels as args and returns a numerical value. For using multiple metrics supply them as a list, e.g [“seq_f1”, my_custom_metric_fn].
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
delimiter (str) – Separator used in the input tsv / csv file. German version of Conll03 uses a whitespace. GermEval 2014 is tab separated
proxies (dict) – proxy configuration to allow downloads of remote datasets. Format as in “requests” library: https://2.python-requests.org//en/latest/user/advanced/#proxies
kwargs (object) – placeholder for passing generic parameters

file_to_dicts(file: str) → [<class 'dict'>][source]¶

dataset_from_dicts(dicts, indices=None, return_baskets=False, non_initial_token='X')[source]¶

Contains all the functionality to turn a list of dict objects into a PyTorch Dataset and a list of tensor names. This can be used for inference mode.

Parameters: dicts (list of dicts) – List of dictionaries where each contains the data of one input sample.
Returns: a Pytorch dataset and a list of tensor names.

class farm.data_handler.processor.BertStyleLMProcessor(tokenizer, max_seq_len, data_dir, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, next_sent_pred=True, next_sent_pred_style='bert-style', max_docs=None, proxies=None, masked_lm_prob=0.15, **kwargs)[source]¶

Bases: farm.data_handler.processor.Processor

Prepares data for masked language model training and next sentence prediction in the style of BERT

__init__(tokenizer, max_seq_len, data_dir, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, next_sent_pred=True, next_sent_pred_style='bert-style', max_docs=None, proxies=None, masked_lm_prob=0.15, **kwargs)[source]¶

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) –
The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
label_list (list) – list of labels to predict (strings). For most cases this should be: [“start_token”, “end_token”]
metric (str, function, or list) – name of metric that shall be used for evaluation, e.g. “acc” or “f1_macro”. Alternatively you can also supply a custom function, that takes preds and labels as args and returns a numerical value. For using multiple metrics supply them as a list, e.g [“acc”, my_custom_metric_fn].
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
next_sent_pred (bool) – Whether to use next_sentence_prediction objective or not
next_sent_pred_style (str) –
Two different styles for next sentence prediction available:
- ”sentence”: Use of a single sentence for Sequence A and a single sentence for Sequence B
- ”bert-style”: Fill up all of max_seq_len tokens and split into Sequence A and B at sentence border.
  If there are too many tokens, Sequence B will be truncated.
max_docs (int) – maximum number of documents to include from input dataset
proxies (dict) – proxy configuration to allow downloads of remote datasets. Format as in “requests” library: https://2.python-requests.org//en/latest/user/advanced/#proxies
masked_lm_prob (float) – probability of masking a token
kwargs (object) – placeholder for passing generic parameters

get_added_tokens()[source]¶

file_to_dicts(file: str) → list[source]¶

dataset_from_dicts(dicts, indices=None, return_baskets=False)[source]¶

Contains all the functionality to turn a list of dict objects into a PyTorch Dataset and a list of tensor names. This can be used for inference mode.

Parameters: dicts (list of dicts) – List of dictionaries where each contains the data of one input sample.
Returns: a Pytorch dataset and a list of tensor names.

estimate_n_samples(filepath, max_docs=500)[source]¶: Estimates the number of samples from a given file BEFORE preprocessing. Used in StreamingDataSilo to estimate the number of steps before actually processing the data. The estimated number of steps will impact some types of Learning Rate Schedules. :param filepath: str or Path, file with data used to create samples (e.g. train.txt) :param max_docs: int, maximum number of docs to read in & use for our estimate of n_samples :return: int, number of samples in the given dataset

class farm.data_handler.processor.SquadProcessor(tokenizer, max_seq_len, data_dir, label_list=None, metric='squad', train_filename=PosixPath('train-v2.0.json'), dev_filename=PosixPath('dev-v2.0.json'), test_filename=None, dev_split=0, doc_stride=128, max_query_length=64, proxies=None, max_answers=6, **kwargs)[source]¶

Bases: farm.data_handler.processor.Processor

Used to handle the SQuAD dataset

__init__(tokenizer, max_seq_len, data_dir, label_list=None, metric='squad', train_filename=PosixPath('train-v2.0.json'), dev_filename=PosixPath('dev-v2.0.json'), test_filename=None, dev_split=0, doc_stride=128, max_query_length=64, proxies=None, max_answers=6, **kwargs)[source]¶

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) –
The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
label_list (list) – list of labels to predict (strings). For most cases this should be: [“start_token”, “end_token”]
metric (str) – name of metric that shall be used for evaluation, can be “squad” or “top_n_accuracy”
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
doc_stride (int) – When the document containing the answer is too long it gets split into part, strided by doc_stride
max_query_length (int) – Maximum length of the question (in number of subword tokens)
kwargs (object) – placeholder for passing generic parameters

dataset_from_dicts(dicts, indices, return_baskets=False)[source]¶

Convert input dictionaries into a pytorch dataset for Question Answering. For this we have an internal representation called “baskets”. Each basket is a question-document pair. Each stage adds or transforms specific information to our baskets.

@param dicts: dict, input dictionary with SQuAD style information present @param indices: list, indices used during multiprocessing so that IDs assigned to our baskets is unique @param return_baskets: boolean, weather to return the baskets or not (baskets are needed during inference) @param return_problematic: boolean, weather to return the IDs of baskets that created errors during processing

file_to_dicts(file: str) → [<class 'dict'>][source]¶

convert_qa_input_dict(infer_dict)[source]¶: Input dictionaries in QA can either have [“context”, “qas”] (internal format) as keys or [“text”, “questions”] (api format). This function converts the latter into the former. It also converts the is_impossible field to answer_type so that NQ and SQuAD dicts have the same format.

class farm.data_handler.processor.NaturalQuestionsProcessor(tokenizer, max_seq_len, data_dir, train_filename=PosixPath('train-v2.0.json'), dev_filename=PosixPath('dev-v2.0.json'), test_filename=None, dev_split=0, doc_stride=128, max_query_length=64, proxies=None, keep_no_answer=0.02, downsample_context_size=None, inference=False, max_answers=6, **kwargs)[source]¶

Bases: farm.data_handler.processor.Processor

Used to handle the Natural Question QA dataset

__init__(tokenizer, max_seq_len, data_dir, train_filename=PosixPath('train-v2.0.json'), dev_filename=PosixPath('dev-v2.0.json'), test_filename=None, dev_split=0, doc_stride=128, max_query_length=64, proxies=None, keep_no_answer=0.02, downsample_context_size=None, inference=False, max_answers=6, **kwargs)[source]¶

Deals with all the preprocessing steps needed for Natural Questions. Follows Alberti 2019 et al. (https://arxiv.org/abs/1901.08634) in merging multiple disjoint short answers into the one longer label span and also by downsampling samples of no_answer during training

Parameters

tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) –
The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing the test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
doc_stride (int) – When the document containing the answer is too long it gets split into parts, strided by doc_stride
max_query_length (int) – Maximum length of the question (in number of subword tokens)
keep_no_answer (float) – The probability that a sample with an no_answer label is kept (0.0 < keep_no_answer <= 1.0). Only works if inference is False
downsample_context_size (int) – Downsampling before any data conversion by taking a short text window of size downsample_context_size around the long answer span. To disable set to None
inference (bool) – Whether we are currently using the Processsor for model inference. If True, the keep_no_answer will be overridden and set to 1
kwargs (object) – placeholder for passing generic parameters

file_to_dicts(file: str) → [<class 'dict'>][source]¶

class farm.data_handler.processor.TextSimilarityProcessor(tokenizer, passage_tokenizer, max_seq_len_query, max_seq_len_passage, data_dir='', metric=None, train_filename='train.json', dev_filename=None, test_filename='test.json', dev_split=0.1, proxies=None, max_samples=None, embed_title=True, num_positives=1, num_hard_negatives=1, shuffle_negatives=True, shuffle_positives=False, label_list=None, **kwargs)[source]¶

Bases: farm.data_handler.processor.Processor

Used to handle the text DPR datasets that come in json format, example: nq-train.json, nq-dev.json, trivia-train.json, trivia-dev.json Datasets can be downloaded from the official DPR github repository (https://github.com/facebookresearch/DPR)

dataset format: list of dictionaries with keys: ‘dataset’, ‘question’, ‘answers’, ‘positive_ctxs’, ‘negative_ctxs’, ‘hard_negative_ctxs’ Each sample is a dictionary of format: {“dataset”: str, “question”: str, “answers”: list of str “positive_ctxs”: list of dictionaries of format {‘title’: str, ‘text’: str, ‘score’: int, ‘title_score’: int, ‘passage_id’: str} “negative_ctxs”: list of dictionaries of format {‘title’: str, ‘text’: str, ‘score’: int, ‘title_score’: int, ‘passage_id’: str} “hard_negative_ctxs”: list of dictionaries of format {‘title’: str, ‘text’: str, ‘score’: int, ‘title_score’: int, ‘passage_id’: str} }

Example of 1 sample in DPR data json: { “dataset”: “nq_dev_psgs_w100”, “question”: “who sings does he love me with reba”, “answers”: [“Linda Davis”], “positive_ctxs”: [ { “title”: “Does He Love You”, “text”: “Does He Love You “Does He Love You” is a song written by Sandy Knox and Billy Stritch, and recorded as a duet by American country music artists Reba McEntire and Linda Davis. It was released in August 1993 as the first single from Reba’s album “Greatest Hits Volume Two”. It is one of country music’s several songs about a love triangle. “Does He Love You” was written in 1982 by Billy Stritch. He recorded it with a trio in which he performed at the time, because he wanted a song that could be sung by the other two members”, “score”: 1000, “title_score”: 1, “passage_id”: “11828866” }, { “title”: “Does He Love You”, “text”: “Does He Love You “Does He Love You” is a song written by Sandy Knox and Billy Stritch, and recorded as a duet by American country music artists Reba McEntire and Linda Davis. It was released in August 1993 as the first single from Reba’s album “Greatest Hits Volume Two”. It is one of country music’s several songs about a love triangle. “Does He Love You” was written in 1982 by Billy Stritch. He recorded it with a trio in which he performed at the time, because he wanted a song that could be sung by the other two members”, “score”: 13.394315, “title_score”: 0, “passage_id”: “11828866” }, …. ] “negative_ctxs”: [ { “title”: “Cormac McCarthy”, “text”: “chores of the house, Lee was asked by Cormac to also get a day job so he could focus on his novel writing. Dismayed with the situation, she moved to Wyoming, where she filed for divorce and landed her first job teaching. Cormac McCarthy is fluent in Spanish and lived in Ibiza, Spain, in the 1960s and later settled in El Paso, Texas, where he lived for nearly 20 years. In an interview with Richard B. Woodward from “The New York Times”, “McCarthy doesn’t drink anymore – he quit 16 years ago in El Paso, with one of his young”, “score”: 0, “title_score”: 0, “passage_id”: “2145653” }, { “title”: “Pragmatic Sanction of 1549”, “text”: “one heir, Charles effectively united the Netherlands as one entity. After Charles’ abdication in 1555, the Seventeen Provinces passed to his son, Philip II of Spain. The Pragmatic Sanction is said to be one example of the Habsburg contest with particularism that contributed to the Dutch Revolt. Each of the provinces had its own laws, customs and political practices. The new policy, imposed from the outside, angered many inhabitants, who viewed their provinces as distinct entities. It and other monarchical acts, such as the creation of bishoprics and promulgation of laws against heresy, stoked resentments, which fired the eruption of”, “score”: 0, “title_score”: 0, “passage_id”: “2271902” }, ….. ] “hard_negative_ctxs”: [ { “title”: “Why Don’t You Love Me (Beyoncé song)”, “text”: “song. According to the lyrics of “Why Don’t You Love Me”, Knowles impersonates a woman who questions her love interest about the reason for which he does not value her fabulousness, convincing him she’s the best thing for him as she sings: “Why don’t you love me… when I make me so damn easy to love?… I got beauty… I got class… I got style and I got ass…”. The singer further tells her love interest that the decision not to choose her is “entirely foolish”. Originally released as a pre-order bonus track on the deluxe edition of “I Am…”, “score”: 14.678405, “title_score”: 0, “passage_id”: “14525568” }, { “title”: “Does He Love You”, “text”: “singing the second chorus. Reba stays behind the wall the whole time, while Linda is in front of her. It then briefly goes back to the dressing room, where Reba continues to smash her lover’s picture. The next scene shows Reba approaching Linda’s house in the pouring rain at night, while Linda stands on her porch as they sing the bridge. The scene then shifts to the next day, where Reba watches from afar as Linda and the man are seen on a speedboat, where he hugs her, implying that Linda is who he truly loves. Reba finally smiles at”, “score”: 14.385411, “title_score”: 0, “passage_id”: “11828871” }, …]

__init__(tokenizer, passage_tokenizer, max_seq_len_query, max_seq_len_passage, data_dir='', metric=None, train_filename='train.json', dev_filename=None, test_filename='test.json', dev_split=0.1, proxies=None, max_samples=None, embed_title=True, num_positives=1, num_hard_negatives=1, shuffle_negatives=True, shuffle_positives=False, label_list=None, **kwargs)[source]¶

Parameters

tokenizer – Used to split a question (str) into tokens
passage_tokenizer – Used to split a passage (str) into tokens.
max_seq_len_query (int) – Query samples are truncated after this many tokens.
max_seq_len_passage (int) – Context/Passage Samples are truncated after this many tokens.
data_dir (str) –
The directory in which the train and dev files can be found. If not available the dataset will be loaded automaticaly if the last directory has the same name as a predefined dataset. These predefined datasets are defined as the keys in the dict at farm.data_handler.utils.DOWNSTREAM_TASK_MAP.
metric (str, function, or list) – name of metric that shall be used for evaluation, e.g. “acc” or “f1_macro”. Alternatively you can also supply a custom function, that takes preds and labels as args and returns a numerical value. For using multiple metrics supply them as a list, e.g [“acc”, my_custom_metric_fn].
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
proxies (dict) – proxy configuration to allow downloads of remote datasets. Format as in “requests” library: https://2.python-requests.org//en/latest/user/advanced/#proxies
max_samples (int) – maximum number of samples to use
embed_title – Whether to embed title in passages during tensorization (bool),
num_hard_negatives – maximum number to hard negative context passages in a sample
num_positives – maximum number to positive context passages in a sample
shuffle_negatives (bool) – Whether to shuffle all the hard_negative passages before selecting the num_hard_negative number of passages
shuffle_positives (bool) – Whether to shuffle all the positive passages before selecting the num_positive number of passages
label_list (list[str]) – list of labels to predict. Usually [“hard_negative”, “positive”]
kwargs (object) – placeholder for passing generic parameters

classmethod load_from_dir(load_dir)[source]¶

Overwriting method from parent class to always load the TextSimilarityProcessor instead of the specific class stored in the config.

Parameters: load_dir – str, directory that contains a ‘processor_config.json’
Returns: An instance of an TextSimilarityProcessor

save(save_dir)[source]¶

Saves the vocabulary to file and also creates a json file containing all the information needed to load the same processor.

Parameters: save_dir (str) – Directory where the files are to be saved

file_to_dicts(file: str) → [<class 'dict'>][source]¶

Converts a Dense Passage Retrieval (DPR) data file in json format to a list of dictionaries.

Parameters: file – filename of DPR data in json format

Returns: list of dictionaries: List[dict] each dictionary: {“query”: str, “passages”: [{“text”: document_text, “title”: xxx, “label”: “positive”, “external_id”: abb123}, {“text”: document_text, “title”: xxx, “label”: “hard_negative”, “external_id”: abb134}, …]}

Data Silo¶

class farm.data_handler.data_silo.DataSilo(processor, batch_size, eval_batch_size=None, distributed=False, automatic_loading=True, max_multiprocessing_chunksize=2000, max_processes=128, caching=False, cache_path=PosixPath('cache/data_silo'))[source]¶

Bases: object

Generates and stores PyTorch DataLoader objects for the train, dev and test datasets. Relies upon functionality in the processor to do the conversion of the data. Will also calculate and display some statistics.

__init__(processor, batch_size, eval_batch_size=None, distributed=False, automatic_loading=True, max_multiprocessing_chunksize=2000, max_processes=128, caching=False, cache_path=PosixPath('cache/data_silo'))[source]¶

Parameters

processor (Processor) – A dataset specific Processor object which will turn input (file or dict) into a Pytorch Dataset.
batch_size (int) – The size of batch that should be returned by the DataLoader for the training set.
eval_batch_size (int) – The size of batch that should be returned by the DataLoaders for the dev and test set.
distributed (bool) – Set to True if you are running in a distributed evn, e.g. using DistributedDataParallel. The DataSilo will init the DataLoader with a DistributedSampler() to distribute batches.
automatic_loading (bool) – Set to False, if you don’t want to automatically load data at initialization.
max_multiprocessing_chunksize (int) – max possible value for chunksize as calculated by calc_chunksize() in farm.utils. For certain cases like lm_finetuning, a smaller value can be set, as the default chunksize values are rather large that might cause memory issues.
max_processes (int) – the maximum number of processes to spawn in the multiprocessing.Pool used in DataSilo. It can be set to 1 to disable the use of multiprocessing or make debugging easier.
caching (bool) – save the processed datasets on disk to save time/compute if the same train data is used to run multiple experiments. Each cache has a checksum based on the train_filename of the Processor and the batch size.
cache_path (Path) – root dir for storing the datasets’ cache.

random_split_ConcatDataset(ds, lengths)[source]¶

Roughly split a Concatdataset into non-overlapping new datasets of given lengths. Samples inside Concatdataset should already be shuffled

Parameters

ds (Dataset) – Dataset to be split
lengths (list) – lengths of splits to be produced

calculate_class_weights(task_name, source='train')[source]¶

For imbalanced datasets, we can calculate class weights that can be used later in the loss function of the prediction head to upweight the loss of minorities.

Parameters: task_name (str) – name of the task as used in the processor

get_data_loader(dataset_name)[source]¶

n_samples(dataset_name)[source]¶

Returns the number of samples in a given dataset.

Parameters: dataset_name (str) – Choose from train, dev or test

class farm.data_handler.data_silo.StreamingDataSilo(processor, batch_size, distributed=False, dataloader_workers=8)[source]¶

Bases: object

Streaming Data Silo loads and preprocesses datasets in parallel to the model training.

The samples are lazily created from the input file and batches are yielded on-the-fly when required during training. This is useful if you: - work with large datasets that don’t fit in memory - want to save time (by not preprocessing the entire dataset before starting training)

For optimal training performance and efficient utilization of shiny GPUs, the pipeline always keeps a few pre-computed batches ready to avoid any waiting time when a batch is requested during training.

To parallelize the creation of batches, PyTorch DataLoader provide an option to use multiple workers that utilize the available CPU cores and ensure enough pre-computed batches.

__init__(processor, batch_size, distributed=False, dataloader_workers=8)[source]¶

Parameters

processor (Processor) – A dataset specific Processor object which will turn input file into a Pytorch Dataset.
batch_size (int) – The size of batch to use for model training.
dataloader_workers (int) – number of workers for PyTorch DataLoader to create batches in parallel

get_data_loader(dataset_name)[source]¶

Returns a new instance of dataloader for the given dataset.

The dataloader lazily yields from Iterable DataSets. After a complete iteration over the input data, the generators gets exhausted. So, for instance, in the case of model training, a new train dataloader must be used for each train epoch.

Parameters: dataset_name (str) – ‘train’, ‘dev’, or ‘test’ set.

class farm.data_handler.data_silo.DataSiloForCrossVal(origsilo, trainset, devset, testset)[source]¶

Bases: object

Perform cross validation or nested cross validation.

For performing cross validation or nested cross validation, we really want to combine all the instances from all the sets or just some of the sets, then create a different data silo instance for each fold or nested fold. Calling DataSiloForCrossVal.make() creates a list of DataSiloForCrossVal instances - one for each fold.

__init__(origsilo, trainset, devset, testset)[source]¶: Initialize self. See help(type(self)) for accurate signature.

get_data_loader(which)[source]¶

classmethod make(datasilo, sets=['train', 'dev', 'test'], n_splits=5, shuffle=True, random_state=None, stratified=True, n_neg_answers_per_question=1, n_inner_splits=None)[source]¶

Create number of folds data-silo-like objects which can be used for training from the original data silo passed on.

Parameters

datasilo (DataSilo) – the data silo that contains the original data
sets (list) – which sets to use to create the xval folds (strings)
n_splits (int) – number of folds to create
shuffle (bool) – shuffle each class’ samples before splitting
random_state (int) – random state for shuffling
stratified (bool) – If class stratification should be done. It is never done with question answering.
n_neg_answers_per_question (int) – number of negative answers per question to include for training
n_inner_splits (int) – Number of inner splits of a nested cross validation. Default is None which means to do a normal (not nested) cross validation. If at least 2 is given a nested cross validation is done. In that case the n_splits parameter is the number of outer splits. The outer cross validation splits the data into a test set and a rest set. The inner cross validation splits the rest data into a train set and a dev set. The advantage of a nested cross validation is that it is doing the inner split not just by random but in a more systematic way. When doing model evaluation this also reduces the variance. This is because you train on more different iterations with more different data constellations.

Dataset¶

farm.data_handler.dataset.convert_features_to_dataset(features)[source]¶

Converts a list of feature dictionaries (one for each sample) into a PyTorch Dataset.

Parameters: features – A list of dictionaries. Each dictionary corresponds to one sample. Its keys are the names of the type of feature and the keys are the features themselves.
Return: a Pytorch dataset and a list of tensor names.

DataLoader¶

class farm.data_handler.dataloader.NamedDataLoader(dataset, batch_size, sampler=None, tensor_names=None, num_workers=0, pin_memory=False)[source]¶

Bases: torch.utils.data.dataloader.DataLoader

A modified version of the PyTorch DataLoader that returns a dictionary where the key is the name of the tensor and the value is the tensor itself.

__init__(dataset, batch_size, sampler=None, tensor_names=None, num_workers=0, pin_memory=False)[source]¶

Parameters

dataset (Dataset) – The dataset that will be wrapped by this NamedDataLoader
sampler (Sampler) – The sampler used by the NamedDataLoader to choose which samples to include in the batch
batch_size (int) – The size of the batch to be returned by the NamedDataLoader
tensor_names (list) – The names of the tensor, in the order that the dataset returns them in.
num_workers (int) – number of workers to use for the DataLoader
pin_memory (bool) – argument for Data Loader to use page-locked memory for faster transfer of data to GPU

farm.data_handler.dataloader.covert_dataset_to_dataloader(dataset, sampler, batch_size)[source]¶

Wraps a PyTorch Dataset with a DataLoader.

Parameters

dataset (Dataset) – Dataset to be wrapped.
sampler (Sampler) – PyTorch sampler used to pick samples in a batch.
batch_size – Number of samples in the batch.

Returns

A DataLoader that wraps the input Dataset.

Samples¶

class farm.data_handler.samples.SampleBasket(id_internal: str, raw: dict, id_external=None, samples=None)[source]¶

Bases: object

An object that contains one source text and the one or more samples that will be processed. This is needed for tasks like question answering where the source text can generate multiple input - label pairs.

__init__(id_internal: str, raw: dict, id_external=None, samples=None)[source]¶

Parameters

id (str) – A unique identifying id. Used for identification within FARM.
external_id (str) – Used for identification outside of FARM. E.g. if another framework wants to pass along its own id with the results.
raw (dict) – Contains the various data needed to form a sample. It is ideally in human readable form.
samples (Sample) – An optional list of Samples used to populate the basket at initialization.

class farm.data_handler.samples.Sample(id, clear_text, tokenized=None, features=None)[source]¶

Bases: object

A single training/test sample. This should contain the input and the label. Is initialized with the human readable clear_text. Over the course of data preprocessing, this object is populated with tokenized and featurized versions of the data.

__init__(id, clear_text, tokenized=None, features=None)[source]¶

Parameters

id (str) – The unique id of the sample
clear_text (dict) – A dictionary containing various human readable fields (e.g. text, label).
tokenized (dict) – A dictionary containing the tokenized version of clear text plus helpful meta data: offsets (start position of each token in the original text) and start_of_word (boolean if a token is the first one of a word).
features (dict) – A dictionary containing features in a vectorized format needed by the model to process this sample.

farm.data_handler.samples.create_sample_one_label_one_text(raw_data, text_index, label_index, basket_id)[source]¶

farm.data_handler.samples.create_sample_ner(split_text, label, basket_id)[source]¶

farm.data_handler.samples.process_answers(answers, doc_offsets, passage_start_c, passage_start_t)[source]¶: TODO Write Comment

farm.data_handler.samples.get_passage_offsets(doc_offsets, doc_stride, passage_len_t, doc_text)[source]¶: Get spans (start and end offsets) for passages by applying a sliding window function. The sliding window moves in steps of doc_stride. Returns a list of dictionaries which each describe the start, end and id of a passage that is formed when chunking a document using a sliding window approach.

farm.data_handler.samples.offset_to_token_idx(token_offsets, ch_idx)[source]¶: Returns the idx of the token at the given character idx

farm.data_handler.samples.offset_to_token_idx_vecorized(token_offsets, ch_idx)[source]¶: Returns the idx of the token at the given character idx

Input Features¶

Contains functions that turn readable clear text input into dictionaries of features

farm.data_handler.input_features.sample_to_features_text(sample, tasks, max_seq_len, tokenizer)[source]¶

Generates a dictionary of features for a given input sample that is to be consumed by a text classification model.

Parameters

sample (Sample) – Sample object that contains human readable text and label fields from a single text classification data sample
tasks (dict) – A dictionary where the keys are the names of the tasks and the values are the details of the task (e.g. label_list, metric, tensor name)
max_seq_len (int) – Sequences are truncated after this many tokens
tokenizer – A tokenizer object that can turn string sentences into a list of tokens

Returns

A list with one dictionary containing the keys “input_ids”, “padding_mask” and “segment_ids” (also “label_ids” if not in inference mode). The values are lists containing those features.

Return type

list

farm.data_handler.input_features.get_roberta_seq_2_start(input_ids)[source]¶

farm.data_handler.input_features.get_camembert_seq_2_start(input_ids)[source]¶