How OSF Meets ‘Desirable Characteristics for Data Repositories’

July 13th, 2022,

Both the National Institutes for Health (NIH) and National Science and Technology Council (NSTC) in the United States have recently issued important guidelines to help the researchers, investigators, and institutions they fund select a repository to share data. The NIH’s Supplemental Policy Information: Selecting a Repository for Data Resulting from NIH-Supported Research guidance comes from their goal to ensure data is managed and shared in a way that is Findable, Accessible, Interoperable, and Re-usable (FAIR). NSTC’s guidance on desirable characteristics for federally funded research comes as a result of the Memorandum on Increasing Access to the Results of Federally Funded Scientific Research issued by the White House Office of Science and Technology Policy in 2013, and the Request for Comments on Desirable Characteristics of Data Repositories in 2020. Sharing research and data accelerates knowledge, enhances the potential for collaboration, and maximizes data value for the research community. Sharing data also makes for a more open government, facilitates evidence-based decision making, and yields greater returns on Americans’ investments in Research and Development.

The proposed guidelines, agreed to by U.S. federal agencies, support consistency across agencies and will eventually coordinate data storage and management of Federally funded research. The aligned-agency characteristics further commit to FAIR research, also enabling equity and security.

The Open Science Framework (OSF) offers researchers an open-source, collaborative management solution for conducting, sharing, and reporting their research across the stages of the research lifecycle, including many integrations with adjacent tools to increase efficiency. The three areas the desirable characteristics focus on – Organizational Infrastructure, Digital Object Management, and Technology – are critical to continued development and maintenance of OSF.

NIH self assessment chart OSF Self-Assessment of NIH's desirable characteristics for selecting a data repository

The NIH Office of the Director also took the step to provide supplemental information on the Desirable Characteristics for Data Repositories to NIH-funded researchers, in anticipation of the new NIH policy for data management and sharing going into effect January 2023. Similar to the guidance from NSTC, they outline the importance of FAIR, security, metadata, access, and use guidance.

NSTC self assessment OSF Self-Assessment of NSTC's Desirable Characteristics of Data Repositories for Federally funded Research

OSF is joined by other generalist repositories (i.e., online location that accepts data regardless of type, format, content, or disciplinary focus) as part of the NIH’s Generalist Repositories Ecosystem Initiative (GREI) in collaborating to provide repositories to researchers that meet and exceed these characteristics. To learn more about those activities, upcoming webinar and training opportunities, and selecting a repository for your data visit the NIH Office of Data Science page on improving data access and the GREI initiative.

The OSF enables open science practices, where the roadmap of priorities for improvement is informed by the needs of research teams and stakeholders. These desirable characteristics provided by key stakeholders of research data are aligned with the current functionality supported by OSF, or are key priorities for the infrastructure roadmap.

To further elaborate on the ways OSF supports best practices in providing a public, reliable, secure, and sustainable repository for data:

Organizational Infrastructure

OSF is free to use by research producers and consumers. Signing up for an account on OSF is quick and easy, by providing a name, email, and password, or by using ORCID or institutional credentials. Access to view and download public content on OSF is free and does not require an account. OSF pages are available in English, with ongoing efforts to support internationalization through infrastructure support and community engagement. Content is posted to OSF in many languages, examples in Chinese, German, Japanese, and Portuguese, and content is viewed and accessed across the globe.

Visitors to OSF can access any public content through the user interface and the open API, and can request access to privately held content if research producers enable consumers to discover the project metadata. Research producers have fine control on privacy and security enabling them to keep some components of projects private and make other components publicly accessible. OSF includes native search and a public API to facilitate both deposit and access of OSF content. Since inception, COS has planned for the long-term preservation of the research posted to OSF through a fund set aside solely for that purpose. The initial fund started at $250,000 in 2013 and has a current value of $438,000. This fund will support many years of read-only access to OSF public content if COS is no longer able to maintain the service.

The OSF API enables end users or other service providers to easily access all public metadata, including withdrawn material. These metadata are indexed in the Datacite, Crossref, Google Scholar, SHARE, Europe PMC, and Web of Science indexes.

Research producers have control of data release and facility to manage data curation both before and after making data available. Versioning of files supports effective curation and historical preservation of content. OSF also maintains workflows for managing public reporting of inappropriate data disclosures for breaches of personally identifying information or intellectual property. These include help desk support through social media @OSFSupport, email support@osf.io, DMCA registration and processes, and other data integrity and security policies through our terms of use and privacy policy documentation and internal controls.

OSF individual projects, components, registrations, and preprints include metadata about their own licensing, include the text of the license in the interface and the API, and offer an option to write a custom license to meet content needs.

OSF facilitates research producers/data administrators’ creation of custom data use guidelines for sensitive/protected data that is only available through the “request access” mechanism

The YOUth Community Operated Registry leverages the registry framework to moderate access to their longitudinal data through preregistration submissions. Researchers requesting access to the data must provide a thorough study design and clearly indicate which subsets of the data they want to analyze. The YOUth registry allows only specific licenses to secondary data analysis to encourage open science best practice and ensure compliance from research teams.

The OSF support center provides clear guidance for our users on their account and security. The support center has sections focusing on the protection of research data, permission, licensing, and data management plans. OSF provides clear and comprehensive terms of use and privacy policies.

Along with COS’ Information Classification and Handling Policy, COS has developed a risk management program that identifies risks and perceived impact of the risk. Both within the development process and in the production environment, evaluating the probability and impact of all changes drives the risk management process to protect against activities such as spoofing, tampering, disclosure or denial of services which could expose the Service to attacks, compromise the privacy and confidentiality of customer data, or disrupt the availability of the Service.

The OSF has an internal, secure interface for managing service administration with Google IAM role based access control for user-support staff. Production GCP access is limited to DevOps staff. COS user-support staff have limited access to certain metadata about projects, registrations, preprints, files, and users (including names and email addresses). Only DevOps staff have direct system access, limited to GCloud CLI & Kubectl for Google GKE cluster access via an on-premise VPN using L2TP over IPSec. Operational policies are in place to ensure private data and metadata are only accessed at the request of users, for legal reasons, or when automated systems have detected public spam associated with a user. Additionally, GCP maintains strong physical security measures at data centers, as noted in section H.

COS maintains a Data Retention & Destruction Policy so that business confidentiality is maintained, customer data is protected from unauthorized access, information is maintained only for the required time to reduce risk, and an audit trail is recorded and maintained.

OSF database backups are maintained in encrypted snapshots for 60 days. Logs are retained indefinitely. File backups are hosted in Google Cloud Coldline storage indefinitely. Upon deletion by users, files are retained for 30 days before being removed.

OSF provides the technical facility for effective ethical management and privacy of storing human data. For example, research producers can set sensitive data to private for never sharing outside of approved collaborators or set projects to “request access” control to enable access requests with review for appropriate credentials. Research producers retain control over the management of public, controlled access, and privacy of data that they maintain. As such, research producers bear responsibility for meeting ethical guidelines outlined by their IRBs for appropriate data management and access of data posted to OSF. OSF administrators are not involved in setting access criteria, reviewing, or approving data access requests for individual projects. Simultaneously, COS maintains robust processes to be responsive to potential or actual data breaches that occur, available here.

Digital Object Management

OSF uses Globally Unique Identifiers (GUIDs) on all objects (users, files, projects, components, registrations, and preprints) across the platform, which are citable in scholarly communication. OSF objects receive a dedicated landing page, which remains accessible even if the object is deleted, marked as spam, or made private (i.e., removed from public view). OSF also supports registration of DOIs for projects, components, and research registrations with Datacite, and for preprints with Crossref. OSF collects ORCID iDs for users and contributors, and provides those with metadata sent for DOI minting, as well as ROR identifiers when contributor affiliations are known. If ORCID auto-updates are enabled, OSF content will appear instantly on ORCID records. Objects can also be connected with related DOIs, like research registrations or preprints to published manuscripts.

Metadata is available to describe artifacts related to every part of a research project’s lifecycle. OSF Registries enables the documentation of analysis plans, including advanced metadata to ensure discovery and association with research outcomes and are moderated and led by research disciplines and communities. Preprints utilizes the Crossref metadata schema to gather disciplinary metadata, as well as linked data and author conflict of interest statements.

Community Operated Registries (COR), OSF Collections, and OSF Preprints enable research communities to set standards and best practices that best align with their disciplines and community guidelines. These features provide the flexibility that each discipline and community needs while retaining transparency and rigor. Registrations and preprints are moderated by their respective communities to ensure rigor, high-quality, and community alignment.

Content posted on OSF is licensed by the users who retain the rights. They are encouraged to license posted content with liberal reuse licenses. OSF Projects, Components, Registrations, and Preprints include a variety of metadata that facilitates attribution, credit, and discovery such as contributors (authors), institutional affiliation, keywords, licenses, and DOIs.

OSF files and data may be downloaded locally in their original format, with over 500 file types supported.

OSF also supports registration of DOIs for projects, components, and research registrations with Datacite, and for preprints with Crossref. OSF also collects ORCID iDs for users and contributors. ORCID iDs are provided with metadata sent for DOI minting, as well as ROR identifiers when contributor affiliations are known. Identifiers are available via the OSF API.

OSF projects provide activity logs capturing all changes, the timestamp, and the user. OSF includes versioning of all files and wiki content. Projects can be forked for the use and reuse of data following the same concept in software development for code. Provenance is retained as functional links between projects so that attribution, credit, reuse, and adaptation over time can be tracked. Files on OSF that are connected through a third-party storage integration maintain provenance of the file and display information to the user when a file is no longer accessible on OSF about when it was last accessed through the service.

Technology

In addition, an automated testing program frequently ensures that connections with external data providers are stable so that researchers will always have access to the data they have associated with their OSF content.

COS maintains and updates an Information Security & Access Policy annually. This policy details employees’ responsibilities toward all types of assets, management's role, training, confidentiality of customer data and acceptable use of resources, etc. A compulsory annual security and privacy training requirement ensures employees refresh their knowledge and understanding each year. Annual penetration tests are conducted. And all non-public data collected by COS on behalf of its Users & Partners is classified as Confidential under COS Information Classification and Handling policy. Access to private customer data is restricted to legitimate business use only.

COS follows a shared responsibility model for cloud-hosted applications. COS applications and data are hosted on Google Cloud Platform (GCP) at Google data centers with strong physical and digital security measures. More information about operational and hardware-level security can be found here.

The OSF database uses GCP's at-rest disk encryption. Columns containing sensitive information (such as third-party storage add-on credentials) are encrypted via AES256-GCM. Passwords are one-way encrypted via bcrypt and cannot be decrypted. OSF uses Central Authentication Service (CAS) software to provide users secure, single sign-on (SSO) to access multiple OSF applications (Preprints, Registries, Meetings, Institutions) while providing login and password credentials only once. Because CAS supports multiple authentication protocols including OAuth2, CAS, and SAML, OSF provides institutions and other services the ability for their users to use their institutional credentials to access OSF services. Currently, the only available non-institutional third-party SSO service is ORCID. All traffic is encrypted via TLS, both internally between pods and clusters, and externally from the internet. OSF also offers two-factor authentication for added account security.

OSF has strong controls to maintain the integrity of files and keeps three types of hashes (MD5, SHA-1, SHA-256) for files. Google Cloud Storage, the backing service for OSF Storage, offers 99.999999999% (11 9's) annual durability. Files are stored in GCS Standard Storage and backed up to GCS Coldline to mitigate any risk of data loss in OSF Storage. The OSF database is backed up via streaming replication 24 hours a day, and incremental restore points are made twice daily, retained for 60 days, and manually verified each month. The OSF is currently undergoing a SOC2 audit and will report the results once completed.

We look forward to working closely with Federal agencies, their funded researchers, and the institutions to support data management and sharing through alignment with the desirable characteristics and OSF. Contact us for more information.

How OSF Meets ‘Desirable Characteristics for Data Repositories’

Recent Posts