Data Handling

Processor

class farm.data_handler.processor.Processor(tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, data_dir, tasks={}, proxies=None)[source]

Bases: abc.ABC

Is used to generate PyTorch Datasets from input data. An implementation of this abstract class should be created for each new data source. Implement the abstract methods: file_to_dicts(), _dict_to_samples(), _sample_to_features() to be compatible with your data format

__init__(tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, data_dir, tasks={}, proxies=None)[source]
Parameters
  • tokenizer – Used to split a sentence (str) into tokens.

  • max_seq_len (int) – Samples are truncated after this many tokens.

  • train_filename (str) – The name of the file containing training data.

  • dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.

  • test_filename (str) – The name of the file containing test data.

  • dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None

  • data_dir (str) – The directory in which the train, test and perhaps dev files can be found.

classmethod load(processor_name, data_dir, tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, **kwargs)[source]

Loads the class of processor specified by processor name.

Parameters
  • processor_name (str) – The class of processor to be loaded.

  • data_dir (str) – Directory where data files are located.

  • tokenizer – A tokenizer object

  • max_seq_len (int) – Sequences longer than this will be truncated.

  • train_filename (str) – The name of the file containing training data.

  • dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.

  • test_filename (str) – The name of the file containing test data.

  • dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None

  • kwargs (object) – placeholder for passing generic parameters

Returns

An instance of the specified processor.

classmethod load_from_dir(load_dir)[source]

Infers the specific type of Processor from a config file (e.g. GNADProcessor) and loads an instance of it.

Parameters

load_dir – str, directory that contains a ‘processor_config.json’

Returns

An instance of a Processor Subclass (e.g. GNADProcessor)

save(save_dir)[source]

Saves the vocabulary to file and also creates a json file containing all the information needed to load the same processor.

Parameters

save_dir (str) – Directory where the files are to be saved

generate_config()[source]

Generates config file from Class and instance attributes (only for sensible config parameters).

add_task(name, metric, label_list, label_column_name=None, label_name=None, task_type=None)[source]
abstract file_to_dicts(file: str) → [<class 'dict'>][source]
dataset_from_dicts(dicts, index=0, rest_api_schema=False, return_baskets=False)[source]

Contains all the functionality to turn a list of dict objects into a PyTorch Dataset and a list of tensor names. This can be used for inference mode.

Parameters

dicts (list of dicts) – List of dictionaries where each contains the data of one input sample.

Returns

a Pytorch dataset and a list of tensor names.

class farm.data_handler.processor.TextClassificationProcessor(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='t', quote_char="'", skiprows=None, label_column_name='label', multilabel=False, header=0, proxies=None, **kwargs)[source]

Bases: farm.data_handler.processor.Processor

Used to handle the text classification datasets that come in tabular format (CSV, TSV, etc.)

__init__(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='\t', quote_char="'", skiprows=None, label_column_name='label', multilabel=False, header=0, proxies=None, **kwargs)[source]
Parameters
  • tokenizer – Used to split a sentence (str) into tokens.

  • max_seq_len (int) – Samples are truncated after this many tokens.

  • train_filename (str) – The name of the file containing training data.

  • dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.

  • test_filename (str) – The name of the file containing test data.

  • dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None

  • data_dir (str) – The directory in which the train, test and perhaps dev files can be found.

file_to_dicts(file: str) → [<class 'dict'>][source]
class farm.data_handler.processor.InferenceProcessor(tokenizer, max_seq_len, **kwargs)[source]

Bases: farm.data_handler.processor.Processor

Generic processor used at inference time: - fast - no labels - pure encoding of text into pytorch dataset - Doesn’t read from file, but only consumes dictionaries (e.g. coming from API requests)

__init__(tokenizer, max_seq_len, **kwargs)[source]
Parameters
  • tokenizer – Used to split a sentence (str) into tokens.

  • max_seq_len (int) – Samples are truncated after this many tokens.

  • train_filename (str) – The name of the file containing training data.

  • dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.

  • test_filename (str) – The name of the file containing test data.

  • dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None

  • data_dir (str) – The directory in which the train, test and perhaps dev files can be found.

classmethod load_from_dir(load_dir)[source]

Overwriting method from parent class to always load the InferenceProcessor instead of the specific class stored in the config.

Parameters

load_dir – str, directory that contains a ‘processor_config.json’

Returns

An instance of an InferenceProcessor

file_to_dicts(file: str) → [<class 'dict'>][source]
class farm.data_handler.processor.NERProcessor(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, delimiter='t', proxies=None, **kwargs)[source]

Bases: farm.data_handler.processor.Processor

Used to handle most NER datasets, like CoNLL or GermEval 2014

__init__(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, delimiter='\t', proxies=None, **kwargs)[source]
Parameters
  • tokenizer – Used to split a sentence (str) into tokens.

  • max_seq_len (int) – Samples are truncated after this many tokens.

  • train_filename (str) – The name of the file containing training data.

  • dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.

  • test_filename (str) – The name of the file containing test data.

  • dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None

  • data_dir (str) – The directory in which the train, test and perhaps dev files can be found.

file_to_dicts(file: str) → [<class 'dict'>][source]
class farm.data_handler.processor.BertStyleLMProcessor(tokenizer, max_seq_len, data_dir, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, next_sent_pred=True, max_docs=None, proxies=None, **kwargs)[source]

Bases: farm.data_handler.processor.Processor

Prepares data for masked language model training and next sentence prediction in the style of BERT

__init__(tokenizer, max_seq_len, data_dir, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, next_sent_pred=True, max_docs=None, proxies=None, **kwargs)[source]
Parameters
  • tokenizer – Used to split a sentence (str) into tokens.

  • max_seq_len (int) – Samples are truncated after this many tokens.

  • train_filename (str) – The name of the file containing training data.

  • dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.

  • test_filename (str) – The name of the file containing test data.

  • dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None

  • data_dir (str) – The directory in which the train, test and perhaps dev files can be found.

file_to_dicts(file: str) → list[source]
class farm.data_handler.processor.SquadProcessor(tokenizer, max_seq_len, data_dir, label_list=None, metric='squad', train_filename='train-v2.0.json', dev_filename='dev-v2.0.json', test_filename=None, dev_split=0, doc_stride=128, max_query_length=64, proxies=None, **kwargs)[source]

Bases: farm.data_handler.processor.Processor

Used to handle the SQuAD dataset

__init__(tokenizer, max_seq_len, data_dir, label_list=None, metric='squad', train_filename='train-v2.0.json', dev_filename='dev-v2.0.json', test_filename=None, dev_split=0, doc_stride=128, max_query_length=64, proxies=None, **kwargs)[source]
Parameters
  • tokenizer – Used to split a sentence (str) into tokens.

  • max_seq_len (int) – Samples are truncated after this many tokens.

  • data_dir (str) – The directory in which the train and dev files can be found. Squad has a private test file

  • label_list (list) – list of labels to predict (strings). For most cases this should be: [“start_token”, “end_token”]

  • metric (str) – name of metric that shall be used for evaluation, can be “squad” or “squad_top_recall”

  • train_filename (str) – The name of the file containing training data.

  • dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.

  • test_filename (str) – None

  • dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None

  • doc_stride (int) – When the document containing the answer is too long it gets split into part, strided by doc_stride

  • max_query_length (int) – Maximum length of the question (in number of subword tokens)

  • kwargs (object) – placeholder for passing generic parameters

dataset_from_dicts(dicts, index=None, rest_api_schema=False, return_baskets=False)[source]

Overwrites the method from the base class since Question Answering processing is quite different. This method allows for documents and questions to be tokenized earlier. Then SampleBaskets are initialized with one document and one question.

apply_tokenization(dictionary)[source]

This performs tokenization on all documents and questions. The result is a list (unnested) where each entry is a dictionary for one document-question pair (potentially mutliple answers).

file_to_dicts(file: str) → [<class 'dict'>][source]
class farm.data_handler.processor.RegressionProcessor(tokenizer, max_seq_len, data_dir, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='t', quote_char="'", skiprows=None, label_column_name='label', label_name='regression_label', scaler_mean=None, scaler_scale=None, proxies=None, **kwargs)[source]

Bases: farm.data_handler.processor.Processor

Used to handle a regression dataset in tab separated text + label

__init__(tokenizer, max_seq_len, data_dir, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='\t', quote_char="'", skiprows=None, label_column_name='label', label_name='regression_label', scaler_mean=None, scaler_scale=None, proxies=None, **kwargs)[source]
Parameters
  • tokenizer – Used to split a sentence (str) into tokens.

  • max_seq_len (int) – Samples are truncated after this many tokens.

  • train_filename (str) – The name of the file containing training data.

  • dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.

  • test_filename (str) – The name of the file containing test data.

  • dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None

  • data_dir (str) – The directory in which the train, test and perhaps dev files can be found.

file_to_dicts(file: str) → [<class 'dict'>][source]

Data Silo

class farm.data_handler.data_silo.DataSilo(processor, batch_size, distributed=False, automatic_loading=True)[source]

Bases: object

Generates and stores PyTorch DataLoader objects for the train, dev and test datasets. Relies upon functionality in the processor to do the conversion of the data. Will also calculate and display some statistics.

__init__(processor, batch_size, distributed=False, automatic_loading=True)[source]
Parameters
  • processor (Processor) – A dataset specific Processor object which will turn input (file or dict) into a Pytorch Dataset.

  • batch_size (int) – The size of batch that should be returned by the DataLoaders.

  • distributed (bool) – Set to True if the program is running in a distributed setting.

  • automatic_loading (bool) – Set to False, if you don’t want to automatically load data at initialization.

random_split_ConcatDataset(ds, lengths)[source]

Roughly split a Concatdataset into non-overlapping new datasets of given lengths. Samples inside Concatdataset should already be shuffled

Parameters
  • ds (Dataset) – Dataset to be split

  • lengths (list) – lengths of splits to be produced

calculate_class_weights(task_name, source='train')[source]

For imbalanced datasets, we can calculate class weights that can be used later in the loss function of the prediction head to upweight the loss of minorities.

Parameters

task_name (str) – name of the task as used in the processor

get_data_loader(dataset)[source]
n_samples(dataset)[source]

Returns the number of samples in a given dataset.

Parameters

dataset – Choose from train, dev or test

Dataset

farm.data_handler.dataset.convert_features_to_dataset(features)[source]

Converts a list of feature dictionaries (one for each sample) into a PyTorch Dataset.

Parameters

features – A list of dictionaries. Each dictionary corresponds to one sample. Its keys are the names of the type of feature and the keys are the features themselves.

Return

a Pytorch dataset and a list of tensor names.

DataLoader

class farm.data_handler.dataloader.NamedDataLoader(dataset, sampler, batch_size, tensor_names)[source]

Bases: torch.utils.data.dataloader.DataLoader

A modified version of the PyTorch DataLoader that returns a dictionary where the key is the name of the tensor and the value is the tensor itself.

__init__(dataset, sampler, batch_size, tensor_names)[source]
Parameters
  • dataset (Dataset) – The dataset that will be wrapped by this NamedDataLoader

  • sampler (Sampler) – The sampler used by the NamedDataLoader to choose which samples to include in the batch

  • batch_size (int) – The size of the batch to be returned by the NamedDataLoader

  • tensor_names (list) – The names of the tensor, in the order that the dataset returns them in.

farm.data_handler.dataloader.covert_dataset_to_dataloader(dataset, sampler, batch_size)[source]

Wraps a PyTorch Dataset with a DataLoader.

Parameters
  • dataset (Dataset) – Dataset to be wrapped.

  • sampler (Sampler) – PyTorch sampler used to pick samples in a batch.

  • batch_size – Number of samples in the batch.

Returns

A DataLoader that wraps the input Dataset.

Samples

class farm.data_handler.samples.SampleBasket(id: str, raw: dict, samples=None)[source]

Bases: object

An object that contains one source text and the one or more samples that will be processed. This is needed for tasks like question answering where the source text can generate multiple input - label pairs.

__init__(id: str, raw: dict, samples=None)[source]
Parameters
  • id (str) – A unique identifying id.

  • raw (dict) – Contains the various data needed to form a sample. It is ideally in human readable form.

  • samples – An optional list of Samples used to populate the basket at initialization.

class farm.data_handler.samples.Sample(id, clear_text, tokenized=None, features=None)[source]

Bases: object

A single training/test sample. This should contain the input and the label. Is initialized with the human readable clear_text. Over the course of data preprocessing, this object is populated with tokenized and featurized versions of the data.

__init__(id, clear_text, tokenized=None, features=None)[source]
Parameters
  • id (str) – The unique id of the sample

  • clear_text (dict) – A dictionary containing various human readable fields (e.g. text, label).

  • tokenized (dict) – A dictionary containing the tokenized version of clear text plus helpful meta data: offsets (start position of each token in the original text) and start_of_word (boolean if a token is the first one of a word).

  • features (dict) – A dictionary containing features in a vectorized format needed by the model to process this sample.

class farm.data_handler.samples.Squad_cleartext(qas_id, question_text, doc_tokens, orig_answer_text, start_position, end_position, is_impossible)[source]

Bases: object

__init__(qas_id, question_text, doc_tokens, orig_answer_text, start_position, end_position, is_impossible)[source]

Initialize self. See help(type(self)) for accurate signature.

farm.data_handler.samples.create_sample_one_label_one_text(raw_data, text_index, label_index, basket_id)[source]
farm.data_handler.samples.create_sample_ner(split_text, label, basket_id)[source]
farm.data_handler.samples.create_samples_squad(dictionary, max_query_len, max_seq_len, doc_stride, n_special_tokens)[source]

This method will split question-document pairs from the SampleBasket into question-passage pairs which will each form one sample. The “t” and “c” in variables stand for token and character respectively.

farm.data_handler.samples.process_answers(answers, doc_offsets, passage_start_c, passage_start_t)[source]

This processes the potentially multiple answers (c.f. Squad dev set) and returns their start and end indices relative to the passage (not the document)

Parameters
  • answers

  • doc_offsets

  • passage_start_c

  • passage_start_t

Returns

farm.data_handler.samples.chunk_into_passages(doc_offsets, doc_stride, passage_len_t, doc_text)[source]

Returns a list of dictionaries which each describe the start, end and id of a passage that is formed when chunking a document using a sliding window approach.

farm.data_handler.samples.offset_to_token_idx(token_offsets, ch_idx)[source]

Returns the idx of the token at the given character idx

farm.data_handler.samples.check_if_training(dictionary)[source]

Input Features

Contains functions that turn readable clear text input into dictionaries of features

farm.data_handler.input_features.sample_to_features_text(sample, tasks, max_seq_len, tokenizer)[source]

Generates a dictionary of features for a given input sample that is to be consumed by a text classification model.

Parameters
  • sample (Sample) – Sample object that contains human readable text and label fields from a single text classification data sample

  • tasks (dict) – A dictionary where the keys are the names of the tasks and the values are the details of the task (e.g. label_list, metric, tensor name)

  • max_seq_len (int) – Sequences are truncated after this many tokens

  • tokenizer – A tokenizer object that can turn string sentences into a list of tokens

Returns

A list with one dictionary containing the keys “input_ids”, “padding_mask” and “segment_ids” (also “label_ids” if not in inference mode). The values are lists containing those features.

Return type

list

farm.data_handler.input_features.samples_to_features_ner(sample, tasks, max_seq_len, tokenizer, non_initial_token='X', **kwargs)[source]

Generates a dictionary of features for a given input sample that is to be consumed by an NER model.

Parameters
  • sample (Sample) – Sample object that contains human readable text and label fields from a single NER data sample

  • tasks (dict) – A dictionary where the keys are the names of the tasks and the values are the details of the task (e.g. label_list, metric, tensor name)

  • max_seq_len (int) – Sequences are truncated after this many tokens

  • tokenizer – A tokenizer object that can turn string sentences into a list of tokens

  • non_initial_token – Token that is inserted into the label sequence in positions where there is a non-word-initial token. This is done since the default NER performs prediction only on word initial tokens

Returns

A list with one dictionary containing the keys “input_ids”, “padding_mask”, “segment_ids”, “initial_mask” (also “label_ids” if not in inference mode). The values are lists containing those features.

Return type

list

farm.data_handler.input_features.samples_to_features_bert_lm(sample, max_seq_len, tokenizer, next_sent_pred=True)[source]

Convert a raw sample (pair of sentences as tokenized strings) into a proper training sample with IDs, LM labels, padding_mask, CLS and SEP tokens etc.

Parameters
  • sample (Sample) – Sample, containing sentence input as strings and is_next label

  • max_seq_len (int) – Maximum length of sequence.

  • tokenizer – Tokenizer

Returns

InputFeatures, containing all inputs and labels of one sample as IDs (as used for model training)

farm.data_handler.input_features.sample_to_features_squad(sample, tokenizer, max_seq_len, max_answers=6)[source]

Prepares data for processing by the model. Supports cases where there are multiple answers for the one question/document pair. max_answers is by default set to 6 since that is the most number of answers in the squad2.0 dev set.

farm.data_handler.input_features.generate_labels(answers, passage_len_t, question_len_t, tokenizer, max_answers)[source]

Creates QA label for each answer in answers. The labels are the index of the start and end token relative to the passage. They are contained in an array of size (max_answers, 2). -1 used to fill array since there the number of answers is often less than max_answers. The index values take in to consideration the question tokens, and also special tokens such as [CLS]. When the answer is not fully contained in the passage, or the question is impossible to answer, the start_idx and end_idx are 0 i.e. start and end are on the very first token (in most models, this is the [CLS] token).

farm.data_handler.input_features.combine_vecs(question_vec, passage_vec, tokenizer, spec_tok_val=-1)[source]

Combine a question_vec and passage_vec in a style that is appropriate to the model. Will add slots in the returned vector for special tokens like [CLS] where the value is determine by spec_tok_val.

farm.data_handler.input_features.answer_in_passage(start_idx, end_idx, passage_len)[source]
farm.data_handler.input_features.sample_to_features_squadOLD(sample, tokenizer, max_seq_len, doc_stride, max_query_length, tasks)[source]