It is an undisputed fact that a huge part of a methodologist’s time is spent on preparing data (Glaser, 2009). This has removed a scientist and a researcher’s time away from the actual work of analysis and testing of research. Making sure data is presentable, verifiable, and accessible has become one of their chief goals, instead of an activity actually related to the research itself.
While data is indeed one of the most—arguably the most—important parts of the research process, there is absolutely no need to dwell on data consolidation more than necessary. Handling research data is, thus imperative to the continuity and veracity of research work.
To that end, research data management (RDM) is a discipline concerned with making data—generated in the course of research—to be accessed as easily as possible by peers, contributors, and readers. This article plans to outline what it is, what it can do, and how to make an effective RDM plan.
Table of Contents
- Overview of RDM
- Benefits of RDM
- Data Management Planning
- Manage Your Research for the Future
Overview of RDM
In the scope of this article, we will refer to research data as simply “data,” which, more specifically, refers to digital forms of data unless otherwise specified. But what is data, really?
What Is Data?
In general, data is information that is collected and recorded for later reference or analysis. Note that you can generate data at any point in your research, but if you fail to document it properly, it will become useless. Spichtinger and Siren (2018) define research data, more specifically, as “recorded factual material commonly accepted in the scientific community as necessary to validate research findings including data sets used to support scholarly publications.”
They further subdivide research data into 4 types, as you can see in the graphic below.
Data can be extracted in a variety of ways. Most readers would automatically assume that data is natively digital, but in fact, digital is only one of the latest entrants to the data sphere. Scientists and researchers prior to the computer age have recorded their research data on other formats, such as:
- paper (e.g. notebooks and journals)
- images (e.g. photographs, scans, and/or film)
- audio (tape or otherwise)
- lab equipment measurements
In addition, remember that “data” means differently among various disciplines. The fact remains, however, that recording data efficiently—meaning you devote less time preparing it—and making it accessible to peers and readers are the best ways to draw value from research. Research data management, or RDM, is a body of knowledge that seeks to do so.
Research Data Management: A Definition
Research data management describes a way to organize and store the data that a research project has accumulated in the most efficient way possible. It manages data gathered during the entire lifetime of the research project by coming up with consistent conventions. It is also responsible for the sharing, access, preservation, and secure disposal of data.
Importance of RDM
There are several reasons RDM is important, apart from the obvious, which is to make data collection easy and efficient. Here are some:
- Data is a transient product and can be easily lost if not saved properly.
- Managing research data correctly saves time and money.
- Data that can be referenced, verified, and validated increases the accuracy and quality of the research. Sharing data can also often lead to developments and insights from its readers, even if they are outside the original research team.
- Managing research data helps spot errors, especially if data is accessible to your team. Anderson (2007) cites that most research teams encounter errors, but only about half have mechanisms in place to address it.
- Funding agencies increasingly turn to data and reproducible results to approve research grants.
Challenges in RDM
Like anything else, managing research data has several challenges. The following are the biggest:
- Improper storage of data. This can lead to data being disposed of carelessly or become unusable. This is one of the biggest issues in data handling, which is directly caused by the research team’s negligence. Depending on the terms of the agreement with the funder and/or sponsor, unusable data may actually be a violation. Plus, without proper data handling, inconsistencies will be overlooked.
- Failure to document technical data. Related to the first, this is a grave issue that stems from the team deviating from the proper standards of data documentation. This will make the findings of the research irreproducible, as any work that seeks to replicate the research will be riddled with inconsistencies.
- The research institution gets no copy of the data. Though rare, this is a big problem nonetheless if the original research team has left the institution and leaves them with no copy of the data. If later publications request access to the data—either to validate or recheck the findings of the report—it will put the institution in an awkward position. Maintaining data elsewhere, such as personal servers, also presents problems, including legal ones.
Sources of Research Data
In general, there are five main sources of research data, as explained below. Note that the source of research data influences how any team manages that data. For example, observational data should be recorded right away to prevent data loss, while reference data isn’t as time-sensitive.
- Observation. This is data collected using observation of activity or behavior, usually through physical observation, surveys, or other sophisticated equipment like motion sensors. As explained above, this data needs real-time documentation as it is impossible to “redo” or recapture again if not recorded.
- Derivation. This uses existing data to arrive at another piece of data, through extrapolation, interpolation, transformation, or some other method. An example is using available data from observation (from the aforementioned motion sensors) to get to a conclusion, such as traffic volume. Unlike observational data, it can be redone, but it would cost a lot of time and resources to do so.
- Experimentation. This is what comes to most people’s minds when faced with the term “research data.” This is data that a researcher collects by changing variables to measure or look at differences in a hypothesis. This is particularly useful to find a cause-and-effect relationship and can be used via statistical projection to apply to a larger set. This is more reproducible than either observational or derived data, but still expensive.
- Simulation. Here, a test model is used to imitate a process or a system over time to find what would or could happen in several conditions. This model is often a computer-generated one, but researchers have simulated tests before using good old-fashioned pen-and-paper. Test models run the gamut of real-world systems, such as weather, geological activity, financial markets, neural pathways, chemical reactions, among others. What sets a simulation data apart is that the test model is usually more important than the test data. Depending on the type of model, simulation is a more affordable data source, though somewhat limited to the accuracy of the model—which is itself a conglomeration of data from other sources.
- Reference. Reference data, also called canonical data, is a type of secondary data source. This is a collection of smaller data sources, such as those above, or those already published and reviewed and are open for access or later research. Peer-reviewed journals, gene sequence databanks, or open-source code are some examples of reference data.
Data Collection Methods
Whatever the source, researchers collect data using one of two methods: qualitative and quantitative. As you can surmise from the name, qualitative is descriptive, which is useful for things that can be observed but not measured. Quantitative, meanwhile, regards numbers.
That said, the source or type of data means either method is much better suited. Language data for use in natural language processing, for example, cannot be measured, so it is more appropriate for a qualitative collection method (which brings us full circle as NLP can also automate qualitative data analysis, as Crowston  pointed out in his paper).
The two methods are explained below.
Qualitative research is research that defines the associations of individuals and experience against a greater context, such as social realities or the world. It is more concerned with observing people and groups and how they live their lives in a particular setting. Therefore, qualitative research collects data that is more descriptive than empirical.
Denzin and Lincoln (1994) describe the many ways of collecting qualitative data using empirical means, such as interviews, observations, analysis, visual materials, and personal experience. In addition, qualitative data doesn’t need to be constrained to text; images, like photographs, video, and audio recordings can also be considered qualitative data. An anthropologist collecting oral history and recording it is a type of qualitative data collection.
What you gain from qualitative research answers how people experience their world and how they act in their social sphere. However, one should note that the person producing the data (e.g. researcher, participant, annotator, etc) is a critical part of the data, as it may change depending on the participant. Therefore, most qualitative data (if not all) is subjective and exists only in relation to the observer (McLeod, 2019).
This, however, is one of the strengths of qualitative research, as the researcher has a closer look at the subject matter that is otherwise only afforded to an insider. This gives him/her unmitigated access to matters, such as nuance and other subtle cues, that quantitative researchers will often miss. It also gives the research team a rare view of contradictions and ambiguities in data, which often reflect real-life (Denscombe, 2010).
On the other hand, quantitative is a more objective type of method because it uses conventional standards of reliability and validity: numbers. While certainly not all data can be measured this way, quantitative data has the advantage of being categorized and/or ranked for a variety of purposes, such as graphs, charts, or tables. This property of quantitative data to be visualized shows the reader how to answer questions, not just display them (Cleveland, 1993).
The main order of business for quantitative researchers is to establish a general framework that bounds different settings and purposes, usually through the use of experimentation. This means that to limit extraneous variables, these experiments are often done in a controlled environment, such as a laboratory. However, this method often limits the resultant data to context, such as the assumptions, limitations, and expertise of the investigator (Black, 1999; Jansson-Boyd, 2018).
The main strength of quantitative data collection, however, is that it can be verified and interpreted with mathematical analysis. This, especially since the investigator detaches himself from the research, means it is more scientific and objective (Carr, 1994; Denscombe, 2010). In addition, numerical data is much easier to replicate than a qualitative one, and while large datasets used to be a monumental task, software can “crunch” numbers today faster than ever (Antonius, 2003).
Benefits of Research Data Management
Research data management offers a lot of benefits to researchers. Some of these are discussed below.
The most important benefit of RDM is that you can secure your data. By making an effective research data management plan, you minimize data loss and unauthorized access by adhering to data storage or organization standards. You also reduce the risk of losing the integrity of data either through accident or negligence.
The most common site to store your research data is in your institution’s repository, like servers (for digital data). Your institution or organization may have advice on where to store your data. Note that many funders generally dislike storing research data that they funded on personal repositories or elsewhere, especially without authorization.
The second most important benefit of RDM is collaboration, especially in an age where research is more complex, with more moving parts. But this is an advantage, as there is a positive correlation between the number of authors in a study compared to those with only one (Lamberts, 2013). Making data accessible for everyone in the group, even those not in the team but in the same discipline can open up massive opportunities to further your own research.
Plus, good RDM routines also improve the efficiency of data access. An organized data directory structure, for example, can make contributing data or building upon the existing dataset much easier. Efficient data organization also makes keeping tabs on the progress of the project much more seamless and puts accountability front and center.
Reproducibility of Research
Should another team, using the data gleaned from those you generated, tries to replicate your research, they should arrive at the same result. Good RDM practices improve your research integrity by allowing third parties to validate your processes and findings. Markowetz (2015) also cites five “selfish reasons” that make reproducibility important, among them avoiding disaster and helps facilitate peer reviews.
In addition, putting your research up for review increases the visibility of your research, which, in turn, grows your number of citations. As Piwowar and Vision (2013) pointed out, open data benefits robust citation by improving its value and impact even after the project or the research is completed. That said, proper attribution is key to uncovering the results of your research, which is why some data citation standards are being developed. One initiative is using digital object identifiers (DOIs) to make data easily traceable across the internet.
Data Management Planning
Knowing why you should manage your research data is all well and good, but the question remains: how should you do it? The answer is that you start with a data management plan, or a DMP, which will cover how your files and datasets are stored, organized, and arranged in a database. There are several database formats, which you can use for huge volumes of data, but if you only need to array them that makes the most sense in a computer, you can find a few tips below.
Before you begin, you need to make many decisions on how to manage your data. For example, funders now require an outline of your data management plans even before you begin your research, along with how regularly you need to furnish them with this data, needed hardware and other equipment, and other issues. This is an ideal starting point to make a DMP as a map to all your planned research data—whether your funder requires it or not.
Additionally, you’ll have to contend with other considerations. Some of them include:
- Funder’s policies and expectations
- Copyright, intellectual property, and privacy issues
- Data format
- Data quantity
- Data storage (hardware and supporting software)
- File naming and directory structure conventions
- Version control, if necessary
- Access and sharing permissions
- Team roles
We look at these considerations further below. This is especially useful for researchers who want to outline a more specific approach on how to organize and simplify their research data management.
Consistency and logic are the top two reasons researchers organize their data. It allows any member of the team to find and use them easily. You need not create a highly detailed flowchart for this, however, as it may simply entail thinking about a file naming convention and how to nest them in your directories for easy access. The ideal time to do this is before the project or the research begins.
Naming conventions also preclude the possibility of overwriting files. File names may contain dates and other identifiers to help you track which files are yours and when they were modified. Metadata, however, is much more accurate for this task, which we’ll also cover below.
For reference, the Library of Congress has recommended formats for data and databases at this page: https://www.loc.gov/preservation/resources/rfs/data.html.
Structure and Hierarchy
As mentioned, structuring your datasets in files and folders is an easy way to start your data management plan. Here are some ideas to get you started:
- Place files in the appropriate folder. Much like in real life, you would want to place files pertaining to a specific subject or topic in one folder.
- Use hierarchy. Use a few folders at the top for broader subjects, then more subfolders as needed for more specific topics.
- Check for existing practices. If your team or institution already has a file structure and naming convention, see if you can adopt it so you don’t have to start from scratch.
- Stick to convention. In any case, stick to your file naming convention to prevent confusion, especially for newcomers to the team or the workspace.
- Archive completed work. To streamline your work further, make sure superseded data is archived, not replaced completely. It may be useful to look at older iterations of research, such as to check for anything you might have missed. The important thing is to separate ongoing data from everything else.
- Maintain a backup. Your data should be backed up, whether they are primarily saved on your local hard drive, on your intranet, or on the cloud. Your backups should have a backup, which means cloud storage, which syncs automatically with files on your local machine, is a great option.
As for files, agree with your team on how to properly label files so you don’t confuse one another when labeling them. It is a good idea to opt for a version control naming scheme, for example, a “v01” or “v02” appended on the file name. In many cases, a final version of the file with the data in question can be marked as “final.” The one who will likely do this is the supervisor, the principal investigator, or the approver of the research.
Metadata means data about data. This is information that tells you about the data contained in a file, which is helpful to find the exact file you are looking for (and for others too). At present, not only does metadata define data but it is also useful in bridging connections among tools and software, like an API (Sen, 2004).
Metadata contains information that is necessary to find, interpret, and use your file, folder, or data. Like your file naming and folder structure conventions, deciding on metadata should be done at the start of the project.
There are generally two ways to attach metadata to your files: embedded metadata and supporting metadata.
This means embedding information into the file itself using various means. This is the easiest, both for the creator of the file and those trying to find it. There are various ways to do this. Some embed metadata into the file itself using XML text, such as this:
<data camera=”b” date=”14-Jun-01″ direction=”left” filename=”021b001.dv” session=”021″ start_frame=”335″ start_time=” 0:00:13.10″ stop_frame=”4914″ stop_time=” 0:03:16.14″ subject_id=”001″ xmlcreatedby=”xmlwrite.py; Time Code for segments added” xmlcreatedon=”Tue Mar 26 15:32:05 2002″>
<data camera=”b” date=”14-Jun-01″ direction=”left” filename=”021b001.dv” session=”021″ start_frame=”335″ start_time=” 0:00:13.10″ stop_frame=”4914″ stop_time=” 0:03:16.14″ subject_id=”001″ xmlcreatedby=”xmlwrite.py; Time Code for segments added” xmlcreatedon=”Tue Mar 26 15:32:05 2002″>
<comments> No comment </comments>
<segments automatic=”no” checked=”yes”>
<fullview> <start frame=”51″ start_frame=”1104″/> <stop frame=”2771″/> </fullview>
<postbackground> <start frame=”2772″ start_frame=”4867″/> <stop frame=”2822″/> </postbackground>
<prebackground> <start frame=”0″ start_frame=”335″/> <stop frame=”50″/> </prebackground> </segments>
Some operating systems also support embedding of metadata this way, such as Microsoft’s Document Properties.
Microsoft Word, for example, allows you to change your document metadata right on the app itself.
Other ways to embed metadata include descriptions, such as on the code or labels within the file itself. Some users also embed metadata using headers or summaries.
This metadata is separate from the main datasets, and are often used in accompaniment with it. These are sets of documents that contain an explanation or context of the data they are trying to support (hence the name), much like an operating manual.
The main disadvantage of supporting metadata is that they run the risk of being as voluminous as the main dataset they prop up. In this case, best practices in structuring and naming, as explained above, also apply.
Data Sharing and Preservation
Data will outlive the project, so you should plan for ways to share and preserve your data for posterity. Data preservation is part of the research data lifecycle. Though there are slightly varying models of data lifecycles (Ball, 2012), the research data lifecycle involves the movement of data from creation to preservation and reuse, ad infinitum.
The basic processes in a typical data lifecycle.
Digital data has an advantage in the sense that it can be maintained for far longer than other types. However, the main drawback to this is that as technology progresses, the tools meant to access this data may change. Good RDM practices, thus plan for this inevitability by ensuring all data can be understood and used even years down the line.
Preserving data, however, does not mean merely saving to backups. As mentioned before, you should future-proof your data using these practices.
- Migrate to newer storage media periodically, across a variety of formats.
- Have backups of backups (and migrate them too).
- Use metadata.
- Use file formats that can be accessed by as many programs as possible or can be imported easily across formats.
- Update firmware on your storage media, if possible.
- Have “hard” copies of data, if possible.
Sharing and Licensing
Data should not be siloed, and research data even more so. There is no sense to hoard data, after all. Sharing is not only a good source of feedback but it is also a way to increase funding interest, garner citations, and build reputation.
Researchers can share data using a variety of means. At its simplest, you can store them in a USB flash drive, which can be borrowed by colleagues. Otherwise, you can use FTP upload on a server, such as to your institution’s repository. Another way includes cloud sharing, which is explained below.
As for licensing, investigators can simply make a request form that anyone who wants to use their data can fill out. Otherwise, if internet publication is preferred, Creative Commons licenses are ideal for research work. Though there are many types of CC licenses, the most appropriate for research data is the “By-Attribution, Non-Commercial” license, which states that anyone can use the data in a researcher’s work as long as they cite their source/s and they avoid using it for profit. Some states or territories, however, have conflicting assessments of the NC clause (Hagedorn et al., 2011), so check with the licensing authority first.
And what is even better is that CC licenses do not need paperwork to be filed; you just need to notify your readers or other interested parties that you’re using a particular type of Creative Commons license. However, CC licenses are irrevocable. Use it only when you are certain that you will not revoke it in the future for any reason.
Data loss is the enemy of nearly every researcher—or near everyone who has stored files in any kind of storage medium. This is why it is crucial to have backups of your data, and to even backup your backup if necessary.
Some institutions often use automatic backups to periodically save research work or any materials stored in their repositories. Ask your computer or network administrator for details of this automatic procedure, especially how often it happens, where it is stored, and how long the backups are kept. In any case, no matter how exemplary your institution’s backup process is, it is still prudent to back your data up on your own.
Cloud storage provides a relatively affordable but highly reliable means of backing up data. In addition, they offer competitive cost-to-space ratios. No matter the cloud provider, though, cloud storage syncs in real-time, so your remote backup data is updated as soon as yours do.
Source: Annual Enterprise Cloud & Data Security Report, page 5
Whatever the case, it’s a good idea to diversify your backup formats and locations so you can keep data as safe as possible.
Manage Your Research for the Future
It can be said that good data management is not the destination, but the journey; it is how researchers lead to discovery and innovation (Wilkinson, 2016). Data, freely shared, can lead to further insights long after the original project is done and the research team has moved on.
This is why it is important to have a logical data management system to index and store your research data, not only for your own use but for those who will come after. Citation is an essential part of the research environment, which brings your findings to the experts who can build on your work. Using initiatives like the DOI and new technologies, such as cloud storage can bring your research to more minds than ever before.
To do that, however, managing your data just as your predecessors did is still a good idea. Following conventions, practicing logical data structure, and citing wisely is the framework upon which the future of science is built.
- Glaser, D. (2009) When Interpretation Goes Awry: The Impact of Interim Testing. David Streiner, Souraya Sidani (Eds.). When Research Goes Off the Rails: Why It Happens and What You Can Do About It (p. 327). Retrieved from https://books.google.com.ph/books?id=aljkTd9unTMC&pg
- Spichtinger, D., Siren, J. (2018). Research Data Management – A European Perspective. Filip Kruse, Jesper Boserup Thestrup (Eds.). The Development of Research Data Management Policies in Horizon 2020 (p. 13). Retrieved from https://doi.org/10.1515/9783110365634
- Anderson, N., et al. (2007) Journal of the American Medical Informatics Association, Vol 14(4). Issues in Biomedical Research Data Management and Analysis: Needs and Barriers. (pp. 478-488). Retrieved from https://doi.org/10.1197/jamia.M2114
- Crowston, K. et al., (2011) International Journal of Social Research Methodology Vol. 15, 2012 (6). Using natural language processing technology for qualitative data analysis (pp. 523-543). Retrieved from https://doi.org/10.1080/13645579.2011.625764
- Denzin N., Lincoln Y. The Discipline and Practice of Qualitative Research. Norman Denzin and Yvonne Lincoln (Eds.). Handbook of Qualitative Research (p. 14). Retrieved from https://www.sagepub.com/sites/default/files/upm-binaries/17670_Chapter1.pdf
- McLeod, S. (2019). Simply Psychology. Qualitative vs. quantitative research. Retrieved from https://www.simplypsychology.org/qualitative-quantitative.html
- Denscombe, M. (2010) The Good Research Guide: For Small-scale Social Research Projects (p. 319). Retrieved from https://www.academia.edu/2240154/The_Good_Research_Guide_5th_edition_
- Cleveland, W. (1993). Visualizing Data. Retrieved from https://dl.acm.org/doi/book/10.5555/529269
- Jansson-Boyd, C. (2018) Quantitative Research: Its Place in Consumer Psychology. Paul Hackett (Ed.) (2018). Quantitative Research Methods in Consumer Psychology. Retrieved from https://books.google.com.ph/books?id=vQB-DwAAQBAJ
- Carr, L. (1994) Journal of Advanced Nursing, (20)4. The strengths and weaknesses of quantitative and qualitative research: what method for nursing? (p. 717) Retrieved from https://pdfs.semanticscholar.org/a87b/ce9f2d5fe771005a2890c92da2cff8a03b32.pdf
- Antonius, R. (2003) Interpreting Quantitative Data with SPSS. Retrieved from https://dx.doi.org/10.4135/9781849209328
- Markowetz, F. (2015) Genome Biology 16, 274. Five selfish reasons to work reproducibly. Retrieved from https://doi.org/10.1186/s13059-015-0850-7
- Piwowar, H., Vision, T., et al. (2013). Data reuse and the open data citation advantage. Retrieved from https://peerj.com/articles/175/
- Lamberts, J. (2013). Two Heads are Better than One: The Importance of Collaboration in Research. HuffPost. Retrieved from https://www.huffpost.com/entry/two-heads-are-better-than_1_b_3804769
- Sen, A. (2004). Metadata management: past, present and future. Decision Support Systems. Retrieved from https://doi.org/10.1016/S0167-9236(02)00208-7
- Ball, A. (2012) Review of Data Management Lifecycle Models. University of Bath, Bath, UK. Retrieved from https://researchportal.bath.ac.uk/en/publications/review-of-data-management-lifecycle-models(23c4ba4b-c694-4787-90e7-aa85ac6edf3a).html
- Hagedorn, G. et al. (2011). Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3234435/.
- Wilkinson, M et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Retrieved from https://www.nature.com/articles/sdata201618.pdf?origin=ppub