Data Handling¶
Processor¶
-
class
farm.data_handler.processor.
Processor
(tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, data_dir, tasks={}, proxies=None)[source]¶ Bases:
abc.ABC
Is used to generate PyTorch Datasets from input data. An implementation of this abstract class should be created for each new data source. Implement the abstract methods: file_to_dicts(), _dict_to_samples(), _sample_to_features() to be compatible with your data format
-
__init__
(tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, data_dir, tasks={}, proxies=None)[source]¶ - Parameters
tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
data_dir (str) – The directory in which the train, test and perhaps dev files can be found.
-
classmethod
load
(processor_name, data_dir, tokenizer, max_seq_len, train_filename, dev_filename, test_filename, dev_split, **kwargs)[source]¶ Loads the class of processor specified by processor name.
- Parameters
processor_name (str) – The class of processor to be loaded.
data_dir (str) – Directory where data files are located.
tokenizer – A tokenizer object
max_seq_len (int) – Sequences longer than this will be truncated.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
kwargs (object) – placeholder for passing generic parameters
- Returns
An instance of the specified processor.
-
classmethod
load_from_dir
(load_dir)[source]¶ Infers the specific type of Processor from a config file (e.g. GNADProcessor) and loads an instance of it.
- Parameters
load_dir – str, directory that contains a ‘processor_config.json’
- Returns
An instance of a Processor Subclass (e.g. GNADProcessor)
-
save
(save_dir)[source]¶ Saves the vocabulary to file and also creates a json file containing all the information needed to load the same processor.
- Parameters
save_dir (str) – Directory where the files are to be saved
-
generate_config
()[source]¶ Generates config file from Class and instance attributes (only for sensible config parameters).
-
add_task
(name, metric, label_list, label_column_name=None, label_name=None, task_type=None)[source]¶
-
dataset_from_dicts
(dicts, index=0, rest_api_schema=False, return_baskets=False)[source]¶ Contains all the functionality to turn a list of dict objects into a PyTorch Dataset and a list of tensor names. This can be used for inference mode.
- Parameters
dicts (list of dicts) – List of dictionaries where each contains the data of one input sample.
- Returns
a Pytorch dataset and a list of tensor names.
-
-
class
farm.data_handler.processor.
TextClassificationProcessor
(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='t', quote_char="'", skiprows=None, label_column_name='label', multilabel=False, header=0, proxies=None, **kwargs)[source]¶ Bases:
farm.data_handler.processor.Processor
Used to handle the text classification datasets that come in tabular format (CSV, TSV, etc.)
-
__init__
(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='\t', quote_char="'", skiprows=None, label_column_name='label', multilabel=False, header=0, proxies=None, **kwargs)[source]¶ - Parameters
tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
data_dir (str) – The directory in which the train, test and perhaps dev files can be found.
-
-
class
farm.data_handler.processor.
InferenceProcessor
(tokenizer, max_seq_len, **kwargs)[source]¶ Bases:
farm.data_handler.processor.Processor
Generic processor used at inference time: - fast - no labels - pure encoding of text into pytorch dataset - Doesn’t read from file, but only consumes dictionaries (e.g. coming from API requests)
-
__init__
(tokenizer, max_seq_len, **kwargs)[source]¶ - Parameters
tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
data_dir (str) – The directory in which the train, test and perhaps dev files can be found.
-
-
class
farm.data_handler.processor.
NERProcessor
(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, delimiter='t', proxies=None, **kwargs)[source]¶ Bases:
farm.data_handler.processor.Processor
Used to handle most NER datasets, like CoNLL or GermEval 2014
-
__init__
(tokenizer, max_seq_len, data_dir, label_list=None, metric=None, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, delimiter='\t', proxies=None, **kwargs)[source]¶ - Parameters
tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
data_dir (str) – The directory in which the train, test and perhaps dev files can be found.
-
-
class
farm.data_handler.processor.
BertStyleLMProcessor
(tokenizer, max_seq_len, data_dir, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, next_sent_pred=True, max_docs=None, proxies=None, **kwargs)[source]¶ Bases:
farm.data_handler.processor.Processor
Prepares data for masked language model training and next sentence prediction in the style of BERT
-
__init__
(tokenizer, max_seq_len, data_dir, train_filename='train.txt', dev_filename='dev.txt', test_filename='test.txt', dev_split=0.0, next_sent_pred=True, max_docs=None, proxies=None, **kwargs)[source]¶ - Parameters
tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
data_dir (str) – The directory in which the train, test and perhaps dev files can be found.
-
-
class
farm.data_handler.processor.
SquadProcessor
(tokenizer, max_seq_len, data_dir, label_list=None, metric='squad', train_filename='train-v2.0.json', dev_filename='dev-v2.0.json', test_filename=None, dev_split=0, doc_stride=128, max_query_length=64, proxies=None, **kwargs)[source]¶ Bases:
farm.data_handler.processor.Processor
Used to handle the SQuAD dataset
-
__init__
(tokenizer, max_seq_len, data_dir, label_list=None, metric='squad', train_filename='train-v2.0.json', dev_filename='dev-v2.0.json', test_filename=None, dev_split=0, doc_stride=128, max_query_length=64, proxies=None, **kwargs)[source]¶ - Parameters
tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
data_dir (str) – The directory in which the train and dev files can be found. Squad has a private test file
label_list (list) – list of labels to predict (strings). For most cases this should be: [“start_token”, “end_token”]
metric (str) – name of metric that shall be used for evaluation, can be “squad” or “squad_top_recall”
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – None
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
doc_stride (int) – When the document containing the answer is too long it gets split into part, strided by doc_stride
max_query_length (int) – Maximum length of the question (in number of subword tokens)
kwargs (object) – placeholder for passing generic parameters
-
dataset_from_dicts
(dicts, index=None, rest_api_schema=False, return_baskets=False)[source]¶ Overwrites the method from the base class since Question Answering processing is quite different. This method allows for documents and questions to be tokenized earlier. Then SampleBaskets are initialized with one document and one question.
-
-
class
farm.data_handler.processor.
RegressionProcessor
(tokenizer, max_seq_len, data_dir, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='t', quote_char="'", skiprows=None, label_column_name='label', label_name='regression_label', scaler_mean=None, scaler_scale=None, proxies=None, **kwargs)[source]¶ Bases:
farm.data_handler.processor.Processor
Used to handle a regression dataset in tab separated text + label
-
__init__
(tokenizer, max_seq_len, data_dir, train_filename='train.tsv', dev_filename=None, test_filename='test.tsv', dev_split=0.1, delimiter='\t', quote_char="'", skiprows=None, label_column_name='label', label_name='regression_label', scaler_mean=None, scaler_scale=None, proxies=None, **kwargs)[source]¶ - Parameters
tokenizer – Used to split a sentence (str) into tokens.
max_seq_len (int) – Samples are truncated after this many tokens.
train_filename (str) – The name of the file containing training data.
dev_filename (str or None) – The name of the file containing the dev data. If None and 0.0 < dev_split < 1.0 the dev set will be a slice of the train set.
test_filename (str) – The name of the file containing test data.
dev_split (float) – The proportion of the train set that will sliced. Only works if dev_filename is set to None
data_dir (str) – The directory in which the train, test and perhaps dev files can be found.
-
Data Silo¶
-
class
farm.data_handler.data_silo.
DataSilo
(processor, batch_size, distributed=False, automatic_loading=True)[source]¶ Bases:
object
Generates and stores PyTorch DataLoader objects for the train, dev and test datasets. Relies upon functionality in the processor to do the conversion of the data. Will also calculate and display some statistics.
-
__init__
(processor, batch_size, distributed=False, automatic_loading=True)[source]¶ - Parameters
processor (Processor) – A dataset specific Processor object which will turn input (file or dict) into a Pytorch Dataset.
batch_size (int) – The size of batch that should be returned by the DataLoaders.
distributed (bool) – Set to True if the program is running in a distributed setting.
automatic_loading (bool) – Set to False, if you don’t want to automatically load data at initialization.
-
random_split_ConcatDataset
(ds, lengths)[source]¶ Roughly split a Concatdataset into non-overlapping new datasets of given lengths. Samples inside Concatdataset should already be shuffled
- Parameters
ds (Dataset) – Dataset to be split
lengths (list) – lengths of splits to be produced
-
Dataset¶
-
farm.data_handler.dataset.
convert_features_to_dataset
(features)[source]¶ Converts a list of feature dictionaries (one for each sample) into a PyTorch Dataset.
- Parameters
features – A list of dictionaries. Each dictionary corresponds to one sample. Its keys are the names of the type of feature and the keys are the features themselves.
- Return
a Pytorch dataset and a list of tensor names.
DataLoader¶
-
class
farm.data_handler.dataloader.
NamedDataLoader
(dataset, sampler, batch_size, tensor_names)[source]¶ Bases:
torch.utils.data.dataloader.DataLoader
A modified version of the PyTorch DataLoader that returns a dictionary where the key is the name of the tensor and the value is the tensor itself.
-
__init__
(dataset, sampler, batch_size, tensor_names)[source]¶ - Parameters
dataset (Dataset) – The dataset that will be wrapped by this NamedDataLoader
sampler (Sampler) – The sampler used by the NamedDataLoader to choose which samples to include in the batch
batch_size (int) – The size of the batch to be returned by the NamedDataLoader
tensor_names (list) – The names of the tensor, in the order that the dataset returns them in.
-
-
farm.data_handler.dataloader.
covert_dataset_to_dataloader
(dataset, sampler, batch_size)[source]¶ Wraps a PyTorch Dataset with a DataLoader.
- Parameters
dataset (Dataset) – Dataset to be wrapped.
sampler (Sampler) – PyTorch sampler used to pick samples in a batch.
batch_size – Number of samples in the batch.
- Returns
A DataLoader that wraps the input Dataset.
Samples¶
-
class
farm.data_handler.samples.
SampleBasket
(id: str, raw: dict, samples=None)[source]¶ Bases:
object
An object that contains one source text and the one or more samples that will be processed. This is needed for tasks like question answering where the source text can generate multiple input - label pairs.
-
class
farm.data_handler.samples.
Sample
(id, clear_text, tokenized=None, features=None)[source]¶ Bases:
object
A single training/test sample. This should contain the input and the label. Is initialized with the human readable clear_text. Over the course of data preprocessing, this object is populated with tokenized and featurized versions of the data.
-
__init__
(id, clear_text, tokenized=None, features=None)[source]¶ - Parameters
id (str) – The unique id of the sample
clear_text (dict) – A dictionary containing various human readable fields (e.g. text, label).
tokenized (dict) – A dictionary containing the tokenized version of clear text plus helpful meta data: offsets (start position of each token in the original text) and start_of_word (boolean if a token is the first one of a word).
features (dict) – A dictionary containing features in a vectorized format needed by the model to process this sample.
-
-
class
farm.data_handler.samples.
Squad_cleartext
(qas_id, question_text, doc_tokens, orig_answer_text, start_position, end_position, is_impossible)[source]¶ Bases:
object
-
farm.data_handler.samples.
create_sample_one_label_one_text
(raw_data, text_index, label_index, basket_id)[source]¶
-
farm.data_handler.samples.
create_samples_squad
(dictionary, max_query_len, max_seq_len, doc_stride, n_special_tokens)[source]¶ This method will split question-document pairs from the SampleBasket into question-passage pairs which will each form one sample. The “t” and “c” in variables stand for token and character respectively.
-
farm.data_handler.samples.
process_answers
(answers, doc_offsets, passage_start_c, passage_start_t)[source]¶ This processes the potentially multiple answers (c.f. Squad dev set) and returns their start and end indices relative to the passage (not the document)
- Parameters
answers –
doc_offsets –
passage_start_c –
passage_start_t –
- Returns
-
farm.data_handler.samples.
chunk_into_passages
(doc_offsets, doc_stride, passage_len_t, doc_text)[source]¶ Returns a list of dictionaries which each describe the start, end and id of a passage that is formed when chunking a document using a sliding window approach.
Input Features¶
Contains functions that turn readable clear text input into dictionaries of features
-
farm.data_handler.input_features.
sample_to_features_text
(sample, tasks, max_seq_len, tokenizer)[source]¶ Generates a dictionary of features for a given input sample that is to be consumed by a text classification model.
- Parameters
sample (Sample) – Sample object that contains human readable text and label fields from a single text classification data sample
tasks (dict) – A dictionary where the keys are the names of the tasks and the values are the details of the task (e.g. label_list, metric, tensor name)
max_seq_len (int) – Sequences are truncated after this many tokens
tokenizer – A tokenizer object that can turn string sentences into a list of tokens
- Returns
A list with one dictionary containing the keys “input_ids”, “padding_mask” and “segment_ids” (also “label_ids” if not in inference mode). The values are lists containing those features.
- Return type
list
-
farm.data_handler.input_features.
samples_to_features_ner
(sample, tasks, max_seq_len, tokenizer, non_initial_token='X', **kwargs)[source]¶ Generates a dictionary of features for a given input sample that is to be consumed by an NER model.
- Parameters
sample (Sample) – Sample object that contains human readable text and label fields from a single NER data sample
tasks (dict) – A dictionary where the keys are the names of the tasks and the values are the details of the task (e.g. label_list, metric, tensor name)
max_seq_len (int) – Sequences are truncated after this many tokens
tokenizer – A tokenizer object that can turn string sentences into a list of tokens
non_initial_token – Token that is inserted into the label sequence in positions where there is a non-word-initial token. This is done since the default NER performs prediction only on word initial tokens
- Returns
A list with one dictionary containing the keys “input_ids”, “padding_mask”, “segment_ids”, “initial_mask” (also “label_ids” if not in inference mode). The values are lists containing those features.
- Return type
list
-
farm.data_handler.input_features.
samples_to_features_bert_lm
(sample, max_seq_len, tokenizer, next_sent_pred=True)[source]¶ Convert a raw sample (pair of sentences as tokenized strings) into a proper training sample with IDs, LM labels, padding_mask, CLS and SEP tokens etc.
- Parameters
sample (Sample) – Sample, containing sentence input as strings and is_next label
max_seq_len (int) – Maximum length of sequence.
tokenizer – Tokenizer
- Returns
InputFeatures, containing all inputs and labels of one sample as IDs (as used for model training)
-
farm.data_handler.input_features.
sample_to_features_squad
(sample, tokenizer, max_seq_len, max_answers=6)[source]¶ Prepares data for processing by the model. Supports cases where there are multiple answers for the one question/document pair. max_answers is by default set to 6 since that is the most number of answers in the squad2.0 dev set.
-
farm.data_handler.input_features.
generate_labels
(answers, passage_len_t, question_len_t, tokenizer, max_answers)[source]¶ Creates QA label for each answer in answers. The labels are the index of the start and end token relative to the passage. They are contained in an array of size (max_answers, 2). -1 used to fill array since there the number of answers is often less than max_answers. The index values take in to consideration the question tokens, and also special tokens such as [CLS]. When the answer is not fully contained in the passage, or the question is impossible to answer, the start_idx and end_idx are 0 i.e. start and end are on the very first token (in most models, this is the [CLS] token).
-
farm.data_handler.input_features.
combine_vecs
(question_vec, passage_vec, tokenizer, spec_tok_val=-1)[source]¶ Combine a question_vec and passage_vec in a style that is appropriate to the model. Will add slots in the returned vector for special tokens like [CLS] where the value is determine by spec_tok_val.