dataset_iterators¶
Concrete DatasetIterator classes.
-
class
deeppavlov.dataset_iterators.basic_classification_iterator.BasicClassificationDatasetIterator(data: dict, fields_to_merge: Optional[List[str]] = None, merged_field: Optional[str] = None, field_to_split: Optional[str] = None, split_fields: Optional[List[str]] = None, split_proportions: Optional[List[float]] = None, seed: Optional[int] = None, shuffle: bool = True, split_seed: Optional[int] = None, stratify: Optional[bool] = None, *args, **kwargs)[source]¶ Class gets data dictionary from DatasetReader instance, merge fields if necessary, split a field if necessary
- Parameters
data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
fields_to_merge – list of fields (out of
"train", "valid", "test") to mergemerged_field – name of field (out of
"train", "valid", "test") to which save merged fieldsfield_to_split – name of field (out of
"train", "valid", "test") to splitsplit_fields – list of fields (out of
"train", "valid", "test") to which save splitted fieldsplit_proportions – list of corresponding proportions for splitting
seed – random seed for iterating
shuffle – whether to shuffle examples in batches
split_seed – random seed for splitting dataset, if
split_seedis None, division is based on seed.stratify – whether to use stratified split
*args – arguments
**kwargs – arguments
-
data¶ dictionary of data with fields “train”, “valid” and “test” (or some of them)
-
class
deeppavlov.dataset_iterators.dialog_iterator.DialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ Iterates over dialog data, generates batches where one sample is one dialog.
A subclass of
DataLearningIterator.-
train¶ list of training dialogs (tuples
(context, response))
-
valid¶ list of validation dialogs (tuples
(context, response))
-
test¶ list of dialogs used for testing (tuples
(context, response))
-
-
class
deeppavlov.dataset_iterators.dialog_iterator.DialogDatasetIndexingIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ Iterates over dialog data, generates batches where one sample is one dialog. Assigns unique index value to each turn item of each dialog.
A subclass of
DataLearningIterator.-
train¶ list of training dialogs (tuples
(context, response))
-
valid¶ list of validation dialogs (tuples
(context, response))
-
test¶ list of dialogs used for testing (tuples
(context, response))
-
-
class
deeppavlov.dataset_iterators.dialog_iterator.DialogDBResultDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ Iterates over dialog data, outputs list of all
'db_result'fields (if present).The class helps to build a list of all
'db_result'values present in a dataset.Inherits key methods and attributes from
DataLearningIterator.-
train¶ list of tuples
(db_result dictionary, '')from “train” data
-
valid¶ list of tuples
(db_result dictionary, '')from “valid” data
-
test¶ list of tuples
(db_result dictionary, '')from “test” data
-
-
class
deeppavlov.dataset_iterators.dstc2_intents_iterator.Dstc2IntentsDatasetIterator(data: dict, fields_to_merge: Optional[List[str]] = None, merged_field: Optional[str] = None, field_to_split: Optional[str] = None, split_fields: Optional[List[str]] = None, split_proportions: Optional[List[float]] = None, seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ Class gets data dictionary from DSTC2DatasetReader instance, construct intents from act and slots, merge fields if necessary, split a field if necessary
- Parameters
data – dictionary of data with fields “train”, “valid” and “test” (or some of them)
fields_to_merge – list of fields (out of
"train", "valid", "test") to mergemerged_field – name of field (out of
"train", "valid", "test") to which save merged fieldsfield_to_split – name of field (out of
"train", "valid", "test") to splitsplit_fields – list of fields (out of
"train", "valid", "test") to which save splitted fieldsplit_proportions – list of corresponding proportions for splitting
seed – random seed
shuffle – whether to shuffle examples in batches
*args – arguments
**kwargs – arguments
-
data¶ dictionary of data with fields “train”, “valid” and “test” (or some of them)
-
class
deeppavlov.dataset_iterators.dstc2_ner_iterator.Dstc2NerDatasetIterator(data: Dict[str, List[Tuple]], slot_values_path: str, seed: Optional[int] = None, shuffle: bool = False)[source]¶ Iterates over data for DSTC2 NER task. Dataset takes a dict with fields ‘train’, ‘test’, ‘valid’. A list of samples (pairs x, y) is stored in each field.
- Parameters
data – list of (x, y) pairs, samples from the dataset: x as well as y can be a tuple of different input features.
dataset_path – path to dataset
seed – value for random seed
shuffle – whether to shuffle the data
-
class
deeppavlov.dataset_iterators.file_paths_iterator.FilePathsIterator(data: Dict[str, List[Union[str, pathlib.Path]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ Dataset iterator for datasets like 1 Billion Word Benchmark. It gets lists of file paths from the data dictionary and returns lines from each file.
- Parameters
data – dict with keys
'train','valid'and'test'and valuesseed – random seed for data shuffling
shuffle – whether to shuffle data during batching
-
class
deeppavlov.dataset_iterators.kvret_dialog_iterator.KvretDialogDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ Inputs data from
DSTC2DatasetReader, constructs dialog history for each turn, generates batches (one sample is a turn).Inherits key methods and attributes from
DataLearningIterator.-
train¶ list of “train”
(context, response)tuples
-
valid¶ list of “valid”
(context, response)tuples
-
test¶ list of “test”
(context, response)tuples
-
-
deeppavlov.dataset_iterators.morphotagger_iterator.preprocess_data(data: List[Tuple[List[str], List[str]]], to_lower: bool = True, append_case: str = 'first') → List[Tuple[List[Tuple[str]], List[str]]][source]¶ Processes all words in data using
process_word().- Parameters
data – a list of pairs (words, tags), each pair corresponds to a single sentence
to_lower – whether to lowercase
append_case – whether to add case mark
- Returns
a list of preprocessed sentences
-
class
deeppavlov.dataset_iterators.morphotagger_iterator.MorphoTaggerDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, min_train_fraction: float = 0.0, validation_split: float = 0.2)[source]¶ Iterates over data for Morphological Tagging. A subclass of
DataLearningIterator.- Parameters
seed – random seed for data shuffling
shuffle – whether to shuffle data during batching
validation_split – the fraction of validation data (is used only if there is no valid subset in data)
min_train_fraction – minimal fraction of train data in train+dev dataset, For fair comparison with UD Pipe it is set to 0.9 for UD experiments. It is actually used only for Turkish data.
-
class
deeppavlov.dataset_iterators.siamese_iterator.SiameseIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ The class contains methods for iterating over a dataset for ranking in training, validation and test mode.
-
class
deeppavlov.dataset_iterators.sqlite_iterator.SQLiteDataIterator(load_path: Union[str, pathlib.Path], batch_size: Optional[int] = None, shuffle: Optional[bool] = None, seed: Optional[int] = None, **kwargs)[source]¶ Iterate over SQLite database. Gen batches from SQLite data. Get document ids and document.
- Parameters
load_path – a path to local DB file
batch_size – a number of samples in a single batch
shuffle – whether to shuffle data during batching
seed – random seed for data shuffling
-
connect¶ a DB connection
-
db_name¶ a DB name
-
doc_ids¶ DB document ids
-
doc2index¶ a dictionary of document indices and their titles
-
batch_size¶ a number of samples in a single batch
-
shuffle¶ whether to shuffle data during batching
-
random¶ an instance of
Randomclass.
-
class
deeppavlov.dataset_iterators.squad_iterator.SquadIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ SquadIterator allows to iterate over examples in SQuAD-like datasets. SquadIterator is used to train
SquadModel.It extracts
context,question,answer_textandanswer_startposition from dataset. Example from a dataset is a tuple of(context, question)and(answer_text, answer_start)-
train¶ train examples
-
valid¶ validation examples
-
test¶ test examples
-
-
class
deeppavlov.dataset_iterators.typos_iterator.TyposDatasetIterator(data: Dict[str, List[Tuple[Any, Any]]], seed: Optional[int] = None, shuffle: bool = True, *args, **kwargs)[source]¶ Implementation of
DataLearningIteratorused for trainingErrorModel