As we explored the architecture of language in the previous chapter, we began to see that it is possible to model natural language in spite of its complexity and flexibility. And yet, the best language models are often highly constrained and applicationspecific. Why is it that models trained in a specific field or domain of the language would perform better than ones trained on general language? Consider that the term “bank” is very likely to be an institution that produces fiscal and monetary tools in an economics, financial, or political domain, whereas in an aviation or vehicular domain it is more likely to be a form of motion that results in the change of direction of an aircraft. By fitting models in a narrower context, the prediction space is smaller and more specific, and therefore better able to handle the flexible aspects of language.
The bulk of our work in the subsequent chapters will be in “feature extraction” and “knowledge engineering” - where we’ll be concerned with the identification of unique vocabulary words, sets of synonyms, interrelationships between entities, and seman‐ tic contexts. However, all of these techniques will revolve around a central text data‐ set: the corpus.
Corpora are collections of related documents that contain natural language. A corpus can be large or small, though generally they consist of hundreds of gigabytes of data inside of thousands of documents. For instance, considering that the average email inbox is 2GB, a moderately sized company of 200 employees would have around a half-terabyte email corpus. Documents contained by a corpus can also vary in size, from tweets to books. Corpora can be annotated, meaning that the text or documents are labeled with the correct responses for supervised learning algorithms, or unanno‐ tated, making them candidates for topic modeling and document clustering.
Naturally the next question should then be “how do we construct a dataset with which to build a language model?” In order to equip you for the rest of the book, this chapter will explore the preliminaries of construction and organization of a domainspecific corpus. Working with text data is substantially different from working with purely numeric data, and there are a number of unique considerations that we will need to take. Whether it is done via scraping, RSS ingestion, or an API, ingesting a raw text corpus in a form that will support the construction of a data product is no trivial task. Moreover, when dealing with a text corpus, we must consider not only how the data is acquired, but also how it is organized on disk. Since these will be very large, often unpredictable datasets, we will need to anticipate potential performance problems and ensure memory safety through streaming data loading and multiproc‐ essing. Finally, we must establish a systematic preprocessing method to transform our raw ingested text into a corpus that is ready for computation and modeling. By the end of this chapter, you should be able to organize your data and establish a reader that knows how to access the text on disk and present it in a standardized fashion for downstream analyses.
Если вам понравилась эта книга поделитесь ею с друзьями, тем самым вы помогаете нам развиваться и добавлять всё больше интересных и нужным вам книг!