dataset_readers¶
Concrete DatasetReader classes.
-
class
deeppavlov.dataset_readers.basic_classification_reader.BasicClassificationDatasetReader[source]¶ Class provides reading dataset in .csv format
-
read(data_path: str, url: Optional[str] = None, format: str = 'csv', class_sep: Optional[str] = None, *args, **kwargs) → dict[source]¶ Read dataset from data_path directory. Reading files are all data_types + extension (i.e for data_types=[“train”, “valid”] files “train.csv” and “valid.csv” form data_path will be read)
- Parameters
data_path – directory with files
url – download data files if data_path not exists or empty
format – extension of files. Set of Values:
"csv", "json"class_sep – string separator of labels in column with labels
sep (str) – delimeter for
"csv"files. Default: None -> only one class per sampleheader (int) – row number to use as the column names
names (array) – list of column names to use
orient (str) – indication of expected JSON string format
lines (boolean) – read the file as a json object per line. Default:
False
- Returns
dictionary with types from data_types. Each field of dictionary is a list of tuples (x_i, y_i)
-
-
class
deeppavlov.dataset_readers.conll2003_reader.Conll2003DatasetReader[source]¶ Class to read training datasets in CoNLL-2003 format
-
class
deeppavlov.dataset_readers.dstc2_reader.DSTC2DatasetReader[source]¶ Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).
There’ve been made the following modifications to the original dataset:
added api calls to restaurant database
example:
{"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.
new actions
bot dialog actions were concatenated into one action (example:
{"dialog_acts": ["ask", "request"]}->{"dialog_acts": ["ask_request"]})if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example:
{"dialog_acts": ["ask"], "slot_vals": ["area"]}->{"dialog_acts": ["ask_area"]})
new train/dev/test split
original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test
minor fixes
fixed several dialogs, where actions were wrongly annotated
uppercased first letter of bot responses
unified punctuation for bot responses
-
classmethod
read(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶ Downloads
'dstc2_v2.tar.gz'archive from ipavlov internal server, decompresses and saves files todata_path.- Parameters
data_path – path to save DSTC2 dataset
dialogs – flag which indicates whether to output list of turns or list of dialogs
- Returns
dictionary that contains
'train'field with dialogs from'dstc2-trn.jsonlist','valid'field with dialogs from'dstc2-val.jsonlist'and'test'field with dialogs from'dstc2-tst.jsonlist'. Each field is a list of tuples(x_i, y_i).
-
class
deeppavlov.dataset_readers.dstc2_reader.SimpleDSTC2DatasetReader[source]¶ Contains labelled dialogs from Dialog State Tracking Challenge 2 (http://camdial.org/~mh521/dstc/).
There’ve been made the following modifications to the original dataset:
added api calls to restaurant database
example:
{"text": "api_call area="south" food="dontcare" pricerange="cheap"", "dialog_acts": ["api_call"]}.
new actions
bot dialog actions were concatenated into one action (example:
{"dialog_acts": ["ask", "request"]}->{"dialog_acts": ["ask_request"]})if a slot key was associated with the dialog action, the new act was a concatenation of an act and a slot key (example:
{"dialog_acts": ["ask"], "slot_vals": ["area"]}->{"dialog_acts": ["ask_area"]})
new train/dev/test split
original dstc2 consisted of three different MDP policies, the original train and dev datasets (consisting of two policies) were merged and randomly split into train/dev/test
minor fixes
fixed several dialogs, where actions were wrongly annotated
uppercased first letter of bot responses
unified punctuation for bot responses
-
classmethod
read(data_path: str, dialogs: bool = False, encoding='utf-8') → Dict[str, List][source]¶ Downloads
'simple_dstc2.tar.gz'archive from internet, decompresses and saves files todata_path.- Parameters
data_path – path to save DSTC2 dataset
dialogs – flag which indicates whether to output list of turns or list of dialogs
- Returns
dictionary that contains
'train'field with dialogs from'simple-dstc2-trn.json','valid'field with dialogs from'simple-dstc2-val.json'and'test'field with dialogs from'simple-dstc2-tst.json'. Each field is a list of tuples(user turn, system turn).
-
class
deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge(domain_knowledge_di: Dict)[source]¶ the DTO-like class to store the domain knowledge from the domain yaml config.
-
classmethod
from_yaml(domain_yml_fpath: Union[str, pathlib.Path] = 'domain.yml')[source]¶ Parses domain.yml domain config file into the DomainKnowledge object :param domain_yml_fpath: path to the domain config file, defaults to domain.yml
- Returns
the loaded DomainKnowledge obect
-
classmethod
-
class
deeppavlov.dataset_readers.md_yaml_dialogs_reader.MD_YAML_DialogsDatasetReader[source]¶ Reads dialogs from dataset composed of
stories.md,nlu.md,domain.yml.stories.mdis to provide the dialogues dataset for model to train on. The dialogues are represented as user messages labels and system response messages labels: (not texts, just action labels). This is so to distinguish the NLU-NLG tasks from the actual dialogues storytelling experience: one should be able to describe just the scripts of dialogues to the system.nlu.mdis contrariwise to provide the NLU training set irrespective of the dialogues scripts.domain.ymlis to desribe the task-specific domain and serves two purposes: provide the NLG templates and provide some specific configuration of the NLU-
classmethod
augment_form(form_name: str, domain_knowledge: deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge, intent2slots2text: Dict) → List[str][source]¶ Replaced the form mention in stories.md with the actual turns relevant to the form :param form_name: the name of form to generate turns for :param domain_knowledge: the domain knowledge (see domain.yml in RASA) relevant to the processed config :param intent2slots2text: the mapping of intents and particular slots onto text
- Returns
the story turns relevant to the passed form
-
classmethod
augment_slot(known_responses: List[str], known_intents: List[str], slot_name: str, form_name: str) → List[str][source]¶ Given the slot name, generates a sequence of system turn asking for a slot and user’ turn providing this slot
- Parameters
known_responses – responses known to the system from domain.yml
known_intents – intents known to the system from domain.yml
slot_name – the name of the slot to augment for
form_name – the name of the form for which the turn is augmented
- Returns
the list of stories.md alike turns
-
classmethod
augment_user_turn(intent2slots2text, line: str, slot_name2text2value) → List[Dict[str, Any]][source]¶ given the turn information generate all the possible stories representing it :param intent2slots2text: the intents and slots to natural language utterances mapping known to the system :param line: the line representing used utterance in stories.md format :param slot_name2text2value: the slot names to values mapping known o the system
- Returns
the batch of all the possible dstc2 representations of the passed intent
-
classmethod
get_augmented_ask_intent_utter(known_intents: List[str], slot_name: str) → Optional[str][source]¶ if the system knows the inform_{slot} intent, return this intent name, otherwise return None :param known_intents: intents known to the system :param slot_name: the slot to look inform intent for
- Returns
the slot informing intent or None
-
classmethod
get_augmented_ask_slot_utter(form_name: str, known_responses: List[str], slot_name: str)[source]¶ if the system knows the ask_{slot} action, return this action name, otherwise return None :param form_name: the name of the currently processed form :param known_responses: actions known to the system :param slot_name: the slot to look asking action for
- Returns
the slot asking action or None
-
classmethod
get_last_users_turn(curr_story_utters: List[Dict]) → Dict[source]¶ Given the dstc2 story, return the last user utterance from it :param curr_story_utters: the dstc2-formatted stoyr
- Returns
the last user utterance from the passed story
-
classmethod
parse_system_turn(domain_knowledge: deeppavlov.dataset_readers.md_yaml_dialogs_reader.DomainKnowledge, line: str) → Dict[source]¶ Given the RASA stories.md line, returns the dstc2-formatted json (dict) for this line :param domain_knowledge: the domain knowledge relevant to the processed stories config (from which line is taken) :param line: the story system step representing line from stories.md
- Returns
the dstc2-formatted passed turn
-
classmethod
read(data_path: str, dialogs: bool = False, ignore_slots: bool = False) → Dict[str, List][source]¶ - Parameters
data_path – path to read dataset from
dialogs – flag which indicates whether to output list of turns or list of dialogs
ignore_slots – whether to ignore slots information provided in stories.md or not
- Returns
dictionary that contains
'train'field with dialogs from'stories-trn.md','valid'field with dialogs from'stories-val.md'and'test'field with dialogs from'stories-tst.md'. Each field is a list of tuples(x_i, y_i).
-
classmethod
-
class
deeppavlov.dataset_readers.faq_reader.FaqDatasetReader[source]¶ Reader for FAQ dataset
-
read(data_path: Optional[str] = None, data_url: Optional[str] = None, x_col_name: str = 'x', y_col_name: str = 'y') → Dict[source]¶ Read FAQ dataset from specified csv file or remote url
- Parameters
data_path – path to csv file of FAQ
data_url – url to csv file of FAQ
x_col_name – name of Question column in csv file
y_col_name – name of Answer column in csv file
- Returns
A dictionary containing training, validation and test parts of the dataset obtainable via
train,validandtestkeys.
-
-
class
deeppavlov.dataset_readers.file_paths_reader.FilePathsReader[source]¶ Find all file paths by a data path glob
-
read(data_path: Union[str, pathlib.Path], train: Optional[str] = None, valid: Optional[str] = None, test: Optional[str] = None, *args, **kwargs) → Dict[source]¶ Find all file paths by a data path glob
- Parameters
data_path – directory with data
train – data path glob relative to data_path
valid – data path glob relative to data_path
test – data path glob relative to data_path
- Returns
A dictionary containing training, validation and test parts of the dataset obtainable via
train,validandtestkeys.
-
-
class
deeppavlov.dataset_readers.kvret_reader.KvretDatasetReader[source]¶ A New Multi-Turn, Multi-Domain, Task-Oriented Dialogue Dataset.
Stanford NLP released a corpus of 3,031 multi-turn dialogues in three distinct domains appropriate for an in-car assistant: calendar scheduling, weather information retrieval, and point-of-interest navigation. The dialogues are grounded through knowledge bases ensuring that they are versatile in their natural language without being completely free form.
For details see https://nlp.stanford.edu/blog/a-new-multi-turn-multi-domain-task-oriented-dialogue-dataset/.
-
classmethod
read(data_path: str, dialogs: bool = False) → Dict[str, List][source]¶ Downloads
'kvrest_public.tar.gz', decompresses, saves files todata_path.- Parameters
data_path – path to save data
dialogs – flag indices whether to output list of turns or list of dialogs
- Returns
dictionary with
'train'containing dialogs from'kvret_train_public.json','valid'containing dialogs from'kvret_valid_public.json','test'containing dialogs from'kvret_test_public.json'. Each fields is a list of tuples(x_i, y_i).
-
classmethod
-
class
deeppavlov.dataset_readers.morphotagging_dataset_reader.MorphotaggerDatasetReader[source]¶ Class to read training datasets in UD format
-
read(data_path: Union[List, str], language: Optional[str] = None, data_types: Optional[List[str]] = None, **kwargs) → Dict[str, List][source]¶ Reads UD dataset from data_path.
- Parameters
data_path – can be either 1. a directory containing files. The file for data_type ‘mode’ is then data_path / {language}-ud-{mode}.conllu 2. a list of files, containing the same number of items as data_types
language – a language to detect filename when it is not given
data_types – which dataset parts among ‘train’, ‘dev’, ‘test’ are returned
- Returns
a dictionary containing dataset fragments (see
read_infile) for given data types
-
-
deeppavlov.dataset_readers.morphotagging_dataset_reader.get_language(filepath: str) → str[source]¶ Extracts language from typical UD filename
-
deeppavlov.dataset_readers.morphotagging_dataset_reader.read_infile(infile: Union[pathlib.Path, str], *, from_words=False, word_column: int = 1, pos_column: int = 3, tag_column: int = 5, head_column: int = 6, dep_column: int = 7, max_sents: int = - 1, read_only_words: bool = False, read_syntax: bool = False) → List[Tuple[List, Optional[List]]][source]¶ Reads input file in CONLL-U format
- Parameters
infile – a path to a file
word_column – column containing words (default=1)
pos_column – column containing part-of-speech labels (default=3)
tag_column – column containing fine-grained tags (default=5)
head_column – column containing syntactic head position (default=6)
dep_column – column containing syntactic dependency label (default=7)
max_sents – maximal number of sentences to read
read_only_words – whether to read only words
read_syntax – whether to return
headsanddepsalongsidetags. Ignored if read_only_words isTrue
- Returns
a list of sentences. Each item contains a word sequence and an output sequence. The output sentence is
None, ifread_only_wordsisTrue, a single list of word tags ifread_syntaxis False, and a list of the form [tags,heads,deps] in caseread_syntaxisTrue.
-
class
deeppavlov.dataset_readers.paraphraser_reader.ParaphraserReader[source]¶ The class to read the paraphraser.ru dataset from files.
Please, see https://paraphraser.ru.
-
class
deeppavlov.dataset_readers.siamese_reader.SiameseReader[source]¶ The class to read dataset for ranking or paraphrase identification with Siamese networks.
-
class
deeppavlov.dataset_readers.squad_dataset_reader.SquadDatasetReader[source]¶ Downloads dataset files and prepares train/valid split.
SQuAD: Stanford Question Answering Dataset https://rajpurkar.github.io/SQuAD-explorer/
SberSQuAD: Dataset from SDSJ Task B https://www.sdsj.ru/ru/contest.html
MultiSQuAD: SQuAD dataset with additional contexts retrieved (by tfidf) from original Wikipedia article.
MultiSQuADRetr: SQuAD dataset with additional contexts retrieved by tfidf document ranker from full Wikipedia.
-
read(dir_path: str, dataset: Optional[str] = 'SQuAD', url: Optional[str] = None, *args, **kwargs) → Dict[str, Dict[str, Any]][source]¶ - Parameters
dir_path – path to save data
dataset – default dataset names:
'SQuAD','SberSQuAD'or'MultiSQuAD'url – link to archive with dataset, use url argument if non-default dataset is used
- Returns
dataset split on train/valid
- Raises
RuntimeError – if dataset is not one of these:
'SQuAD','SberSQuAD','MultiSQuAD'.
-
-
class
deeppavlov.dataset_readers.typos_reader.TyposCustom[source]¶ Base class for reading spelling corrections dataset files
-
static
build(data_path: str) → pathlib.Path[source]¶ Base method that interprets
data_pathargument.- Parameters
data_path – path to the tsv-file containing erroneous and corrected words
- Returns
the same path as a
Pathobject
-
static
-
class
deeppavlov.dataset_readers.typos_reader.TyposKartaslov[source]¶ Implementation of
TyposCustomthat works with a Russian misspellings dataset from kartaslov-
static
build(data_path: str) → pathlib.Path[source]¶ Download misspellings list from github
- Parameters
data_path – target directory to download the data to
- Returns
path to the resulting csv-file
-
static
-
class
deeppavlov.dataset_readers.typos_reader.TyposWikipedia[source]¶ Implementation of
TyposCustomthat works with English Wikipedia’s list of common misspellings-
static
build(data_path: str) → pathlib.Path[source]¶ Download and parse common misspellings list from Wikipedia
- Parameters
data_path – target directory to download the data to
- Returns
path to the resulting tsv-file
-
static
-
class
deeppavlov.dataset_readers.ubuntu_v2_reader.UbuntuV2Reader[source]¶ The class to read the Ubuntu V2 dataset from csv files.
Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.
-
read(data_path: str, positive_samples=False, *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]¶ Read the Ubuntu V2 dataset from csv files.
- Parameters
data_path – A path to a folder with dataset csv files.
positive_samples – if True, only positive context-response pairs will be taken for train
-
-
class
deeppavlov.dataset_readers.ubuntu_v2_mt_reader.UbuntuV2MTReader[source]¶ The class to read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue
context.Please, see https://github.com/rkadlec/ubuntu-ranking-dataset-creator.
- Parameters
data_path – A path to a folder with dataset csv files.
num_context_turns – A maximum number of dialogue
contextturns.padding – “post” or “pre” context sentences padding
-
read(data_path: str, num_context_turns: int = 1, padding: str = 'post', *args, **kwargs) → Dict[str, List[Tuple[List[str], int]]][source]¶ Read the Ubuntu V2 dataset from csv files taking into account multi-turn dialogue
context.- Parameters
data_path – A path to a folder with dataset csv files.
num_context_turns – A maximum number of dialogue
contextturns.padding – “post” or “pre” context sentences padding
- Returns
Dictionary with keys “train”, “valid”, “test” and parts of the dataset as their values