This last observation is less surprising when we consider that text and record structures are the primary domains for the two subfields of computer science that focus on data management, namely text retrieval and databases.
A notable feature of linguistic data management is that usually brings both data types together, and that it can draw on results and techniques from both fields.
Two sentences, read by all speakers, were designed to bring out dialect variation: The remaining sentences were chosen to be phonetically rich, involving all phones (sounds) and a comprehensive range of diphones (phone bigrams).
Additionally, the design strikes a balance between multiple speakers saying the same sentence in order to permit comparison across speakers, and having a large range of sentences covered by the corpus to get maximal coverage of diphones.
The inclusion of speaker demographics brings in many more independent variables, that may help to account for variation in the data, and which facilitate later uses of the corpus for purposes that were not envisaged when the corpus was created, such as sociolinguistics.
The TIMIT corpus of read speech was the first annotated speech database to be widely distributed, and it has an especially clear organization.
Despite its complexity, the TIMIT corpus only contains two fundamental data types, namely lexicons and texts.
As we saw in 2., most lexical resources can be represented using a record structure, i.e. A lexical resource could be a conventional dictionary or comparative wordlist, as illustrated.
First, the corpus contains two layers of annotation, at the phonetic and orthographic levels.
In general, a text or speech corpus may be annotated at many different linguistic levels, including morphological, syntactic, and discourse levels.