LibGuides: Research Data Management (RDM): CodeBooks and Data Dictionaries

CodeBooks

Codebooks are documents that explain the variables in your dataset. ICPSR suggests that these documents should note:

Variable name: The name or number assigned to each variable in the data collection. Some researchers prefer to use mnemonic abbreviations (e.g., EMPLOY1), while others use alphanumeric patterns (e.g., VAR001). For survey data, try to name variables after the question numbers - e.g., Q1, Q2b, etc. [In above example, H40-SF12-2]

Variable label: A brief description to identify the variable for the user. Where possible, use the exact question or research wording. ["SF12 - ASSESSMENT OF R'S GENERAL HEALTH"]

Question text: Where applicable, the exact wording from survey questions. ["In general, would you say your health is . . ."]

Values: The actual coded values in the data for this variable. [1, 2, 3, 4, 5]

Value labels: The textual descriptions of the codes. [Excellent, Very Good, Good, Fair, Poor]

Summary statistics: Where appropriate and depending on the type of variable, provide unweighted summary statistics for quick reference. For categorical variables, for instance, frequency counts showing the number of times a value occurs and the percentage of cases that value represents for the variable are appropriate. For continuous variables, minimum, maximum, and median values are relevant.

Missing data: Where applicable, the values and labels of missing data. Missing data can bias an analysis and is important to convey in study documentation. Remember to describe all missing codes, including "system missing" and blank. [e.g., Refusal (-1)]

Universe skip patterns: Where applicable, information about the population to which the variable refers, as well as the preceding and following variables. [e.g., Default Next Question: H00035.00]

Notes: Additional notes, remarks, or comments that contextualize the information conveyed in the variable or relay special instructions. For measures or questions from copyrighted instruments, the notes field is the appropriate location to cite the source.

Data Dictionaries

Data Dictionaries are very similar to (and arguably the same as) codebooks. DataQ has a great entry on data dictionaries written by Yasmeen Shorish:

This video from Kristin Briney is one of the most descriptive, yet concise, resources explaining data dictionaries: https://www.youtube.com/watch?v=Fe3i9qyqPjo . The video details what should go into the dictionary (variable or field names, units, relationships to other variables, data types, what people need to make sense of a researcher's work) and explains the reasons why a researcher might want one. There are also examples given of what a data dictionary looks like. There is also a blog post on the topic from the same author, in case you prefer text to video: http://dataabinitio.com/?p=454

For those looking at data dictionaries from a relational database perspective, this video tutorial provides stepwise instruction: https://www.youtube.com/embed/QRMUReSENjU

A robust and technical definition of a data dictionary from a LIS encyclopedia may be useful for some researchers and librarians: "Data Dictionary (Metadata Dictionary): A subsystem of a database that records the definitions (semantics) for all the metadata elements used in a database. A data dictionary may also include detailed documentation about the rellationships among metadata elements, as well as syntax and schema application rules. The term data dictionary comes from the relational database community and may be viewed as a type of metadata specification" Drake, M. A. (2003). Metadata in the World Wide Web in Encyclopedia of library and information science. 2nd ed. / New York: Marcel Dekker.

See this entry at DataQ

Credit

Grateful acknowledgement to the University of Pennsylvania Penn Libraries for their permission to use and modify their template: Data Management Resources