Best Practices

From BCOeditor Wiki
Revision as of 15:39, 20 May 2024 by ChinweokeO (talk | contribs)
Jump to navigation Jump to search

The primary goal of this guide is to offer comprehensive guidance for effectively using and maintaining BioCompute Objects in the fields of bioinformatics and computational biology. This includes promoting consistency, data integrity, and collaboration among professionals, while supporting regulatory compliance and enhancing usability and adoption. Furthermore, it aims to facilitate the ongoing maintenance and updates of BCOs to ensure their continued relevance and usability across different computational domains

General

  • The required domains are defined by the IEEE. However, a BioCompute Object is considered complete when an Error Domain exists.
  • Versioning is allowed, but only if the changes do not affect the workflow or output. BCO versioning follows a minor.patch schema, no major versions are allowed (substantial changes result in a new BCO). Minor changes are things like a change of contact information for a contributor, patch changes are things like spelling and grammar fixes.
  • In general, any step that does not transform data does not need to be included in the Description Domain as a formal step, and can be described instead in the Usability Domain. For example, arranging rows and columns in a table, or formatting a figure. Steps that transform data should comprise their own step in the Description Domain.
  • The Usability Domain should contain enough information to enable a naïve user generally skilled in bioinformatics to understand the analysis. This means that references to commonly used resources (such as basic Unix commands, well-known databases like NCBI, basic terms like “alignment,” etc.) do not need to be explained, but references to less well-known resources (such as obscure python packages, etc.) should be described. The description should be tailored to the intended audience, and BCOs intended for public consumption should assume a basic level of bioinformatics proficiency.

BioCompute Registry

The BioCompute Registry is a domain registry for BCO IDs in which users can register their institution or organization. Similar to a website registry, this will allow the owner of that domain to use any domain organization of their choosing, and prevent naming collisions between groups. For example, the owner of “GW” can build BCOs GW_0001.1, GW01A, GW_, or any other naming system of their preference, and these will not conflict with another registered domain, such as FDA_0001.1, etc. The BCO Registry registration numbers may not exceed five characters and are recommended to be three characters. Any alphanumeric characters are acceptable.

A BCO may be registered only by the author of the object, and the domain must be approved by the domain holder. Until automated systems are in place, register a BCO by sending the BCO ID and email of the registrant to the [BioCompute Team](mailto:keeneyjg@gwu.edu). The following institutional domains have been reserved:

  • GWU
  • FDA
  • NIH
  • CDC
  • NCI

Preferred Ontologies

Semantic Versioning

BCO versioning should adhere to semantic versioning to establish how version numbers are assigned and incremented. Given a version number MAJOR.MINOR.PATCH, when versioning a BCO increment the:

1.MAJOR version when you make incompatible API changes

2.MINOR version when you add functionality in a backward-compatible manner

3.PATCH version when you make backward-compatible bug fixes. Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

PAV Ontology and PROV-O

To preserve the provenance of each BCO, the contribution type of the reviewers and contributors is a choice taken from PAV ontology: provenance, authoring, and versioning, which also maps to the PROV-O. The following are possible values for the status of an object in the review process:

  • unreviewed flag indicates that the object has been submitted, but no further evaluation or verification has occurred.
  • in-review flag indicates that verification is underway.
  • approved flag indicates that the BCO has been verified and reviewed.
  • suspended flag indicates an object that was once valid is no longer considered valid.
  • rejected flag indicates that an error or inconsistency was detected in the BCO, and it has been removed or rejected.

Namespace: CURIE

The external references field contains a list of the databases and/or ontology IDs that are cross-referenced in the BCO. The external references are used to provide more specificity in the information related to BCO entries. Cross-referenced resources need to be available in the public domain. The external references are stored in the form of prefixed identifiers (CURIEs). These CURIEs map directly to the URIs maintained by identifiers.org. See Section 3.5 for a list of the CURIEs used in this example.

BCO Creation and Versioning

Intended Audience: BCO authors

  • BioCompute IDs are used as persistent URLs. A novel usability domain must result in the creation of a new BCO with a new BCO ID. BCO IDs are immutable upon creation, and are never deleted or retired. If the usability domain (UD) remains unchanged, this results in a new version of the BCO. BCO ID example: OMX_000001
  • BCO major and minor versions can be incremented based on project/institution documented policies.
  • The BioCompute consortium maintains a database of registered authorities. Registered authorities are able to assign their reserved prefixes to their own IDs in the object_id field, such as OMX_000001. We encourage that everyone registers a prefix at biocomputeobject.org.

BCO Metadata

The three metadata fields are filled out at the time of submission. Validity check fills in the spec_version with the IEEE URL, an option to run a SHA256 (or just input your own hash value) for etag, and object_id is assigned (with option to choose from any prefix associated with the account).

Domain-specific guidance

The following fields are optional based on the IEEE-2791-2020 standard: Extension Domain, Parametric Domain, Error Domain.

Provenance Domain

This domain serves as a repository for metadata describing the BCO.

Usability Domain

Authors have access to a text field where they can provide a comprehensive description of the analysis and relevant details.

Extension Domain

Format of how the schema would be defined: Execution domain

Description Domain

It includes a detailed breakdown of the individual steps involved, the external resources essential for each step, and the relationships between input and output objects.

Execution Domain

When recording manual curation, the script field of the execution_domain should link to a Google Document or GitHub markdown that describes the steps, either programmatically or in a stepwise fashion. Manual curation steps should ALSO be properly documented in the description_domain. An easy way to conceptualize this is: Description domain is for people, Execution domain is for machine (or programmers).

Parametric Domain

This domain captures any modifications made to parameters from their default values.

Input and Output Domain

This domain serves as a catalog of global input and output files used in the analysis.

Error domain

This domain can support a “QA/QC rules” subdomain which provides rules that, if the output file does not pass the appropriate criteria, then it is flagged as an error.

BCO Form-based portal

Intended Audience: BCO tool developers and authors

BCOs can be created using any bioinformatics platform that has BCO read and write functionalities. For users who do not have access to a bioinformatics platform they can use the BCO Builder in the BCO Portal which has some of the basic API functionalities:

  • Create a BCO that is conformant to IEEE-2791.
  • Download and install an instance within an organization’s firewall
  • View videos and documentation on tool use