Faq

From BCOeditor Wiki
Revision as of 05:05, 22 February 2024 by ChinweokeO (talk | contribs) (→‎General)
Jump to navigation Jump to search

Go back to BioCompute Objects.

Events

1. Is there an upcoming event related to BioCompute Objects?

Yes, there will be a mini symposium at the FDA on May 10th. Both FDA and regulated industry representatives will participate in mapping the way forward for using BCOs in submissions and establishing best practices. More information can be found here

2. Do I need to register for the BioCompute workshop?

Registration is not required, but it helps in planning the event better. If you'd like to attend, you can register here

General

1. Where would information regarding data sources and standard operating procedures be? Which specific domain?

Data sources should be recorded as described by the input_subdomain in the “io_domain” and the input_list in the “description_domain”. Standard operating procedures and any other information about data transformations SHOULD be elaborated upon in the “usability_domain”.

2. How can a third-party access the URI?

URI’s can be directed to local paths. In these cases, the necessary files are shared with the parties that will require access. If it is a link to a public domain, it will be easily accessible for all.

3. What is a SHA1 Checksum?

A SHA-1 checksum, or Secure Hash Algorithm 1 checksum, is a fixed-size output (160 bits) generated from input data to uniquely identify and verify the integrity of files or documents. In BioCompute Objects (BCOs), it serves to ensure the unchanged state of computational workflows by comparing calculated and original checksums. This allows for accuracy in viewing and downloading BCOs.

4. How do I sign in with an ORCID/What is an ORCID?

ORCID stands for Open Researcher and Contributor ID, and is a free, unique identifier assigned to researchers, providing a standardized way to link researchers to their scholarly activities. To sign in with your ORCID, create an account at: https://orcid.org/. Using the credentials associated with your ORCID account you can log in to view and edit BCOs.

Pipeline Steps

1. Do pipeline steps have to represent sequentially run steps? How can you represent steps also run in parallel?

The standard does not mandate any particular numbering schema, but it’s best practice to pick the most logically intuitive numbering system. For example, a user may run a somatic SNV profiling step at the same time as a structural CNV analysis. So if in the example I mentioned, the alignment is step #2, then you might (arbitrarily) call the SNV profiling step #3, and the CNV analysis step #4. The fact that they pull from the output of the same step (#2) can easily be detected programmatically and represented in whatever way is suitable (e.g. graphically).

Inputs and Outputs

1. What is the relationship and difference between input_list in description_domain and input_subdomain? Does input_subdomain contain all the input files of all the pipeline steps?

Yes. The Input Domain is for global inputs. The input_list/output_list in the pipeline_steps is specific to individual steps and is used to trace data flow if granular detail is needed. If not needed, a user can simply look at the IO domain for the overall view of the pipeline inputs.

2. What is the relationship and difference between output_list in description_domain and output_subdomain? Does output_subdomain contain all the output files of all the pipeline steps?

The Output Domain is for global outputs. The input_list/output_list in the pipeline_steps is specific to individual steps and is used to trace data flow if granular detail is needed. If not needed, a user can simply look at the IO domain for the overall view of the pipeline outputs.

3. There is an access_time property for uri, which is referenced by input_list, output_list, input_subdomain, and output_subdomain. What does access_time mean for output files? Aren’t output files generated by pipeline steps?

Yes they are, the timestamp is used for creation in those cases.

4. Can a script from the execution domain also be considered an input?

This is not usually the case, but it is possible for the script to be assessed as an input if it is used in the workflow to bring about an output.

Extensions

1. What is the role of extension_domain? How does it relate to other domains? Is it required in some pipeline steps? Or does it affect the execution? Or something else?

Extension Domain is never required, it is always optional. It is a user-defined space for capturing anything not already captured in the base BCO. To use it, one generates an extension schema (referenced in the extension_schema), and the associated fields within the BCO. For example, if a user wants to include a specialized ontology with definitions, it can be added here. It’s meant to capture anything idiosyncratic to that workflow not already captured in the standard and is very flexible.

2. How can BCOs be used for knowledgebases?

Using BioCompute’s pre-defined fields and standards, knowledgebases can generate a BioCompute Object (BCO) to document the metadata, quality control, and integration pipelines developed for different workflows. BCOs can be used to document each release. The structured data in a BCO makes it very easy to identify changes between releases (including changes to the curation/data processing pipeline, attribution to curators, or datasets processed), or revert to previous releases.

BCOs can be generated via a user-friendly instance of a BCO editor and can be maintained and shared through versioned, stable IDs stored under a single domain of that knowledgebase. BCOs not only provide complete transparency to their data submitters (authors, curators, other databases, etc.), collaborators, and users but also provide an efficient mechanism to reproduce the complete workflow through the information stored in different domains (such as description, execution, io, error, etc.) in the machine and human-readable formats.

The most common way of adapting BCOs for use in knowledgebases is by leveraging the Extension Domain. In this example, the Extension Domain is used for calling fields based on column headers. Note that the Extension Domain identifies its own schema, which defines the column headers and identifies them as required where appropriate. Because the JSON format of a BCO is human and machine-readable (and can be further adapted for any manner of display or editing through a user interface), BCOs are amendable to either manual or automatic curation processes, such as the curation process that populates those fields in the above example.

Prerequisites

1. What is the difference between software_prerequisites in execution_domain and prerequisites in the description_domain? Is the former global, while the latter only applies to one specific pipeline step?

Correct, Execution Domain is for anything related to the environment in which the pipeline was executed, and the Description Domain is specific to the software in those steps. So if I’ve written a shell script to run the pipeline, and in one step it includes myScript.py to comb through results and pick out elements of interest, myScript.py might be an Execution Domain prerequisite, and any packages or dependencies called from within the script are Description Domain level prerequisites. Alternatively, if I’m using the HIVE platform, any libraries needed to run HIVE are Execution Domain level.

Knowledgebases

1. Can BCOs be used for curating databases?

Yes. BCOs have been used in this capacity, such as in the FDA’s ARGOS database of infectious diseases and the GlyGen database of glycosylation sites. The following recommendations are compiled from these use cases. Although these recommendations are built from practical experience, they may not address the needs of every database. Users are free to make modifications at their own discretion.

Using BioCompute’s pre-defined fields and standards, knowledgebases can generate a BioCompute Object (BCO) to document the metadata, quality control, and integration pipelines developed for different workflows. BCOs can be used to document each release. The structured data in a BCO makes it very easy to identify changes between releases (including changes to the curation/data processing pipeline, attribution to curators, or datasets processed), or revert to previous releases.

BCOs can be generated via a user-friendly instance of a BCO editor and can be maintained and shared through versioned, stable IDs stored under a single domain of that knowledgebase. BCOs not only provide complete transparency to their data submitters (authors, curators, other databases, etc.), collaborators, and users but also provide an efficient mechanism to reproduce the complete workflow through the information stored in different domains (such as description, execution, io, error, etc.) in the machine and human-readable formats.

The most common way of adapting BCOs for use in knowledgebases is by leveraging the Extension Domain. In this example, the Extension Domain is used for calling fields based on column headers. Note that the Extension Domain identifies its own schema, which defines the column headers and identifies them as required where appropriate. Because the JSON format of a BCO is human and machine-readable (and can be further adapted for any manner of display or editing through a user interface), BCOs are amendable to either manual or automatic curation processes, such as the curation process that populates those fields in the above example.

Saving and Publishing a BCO

1. Why is my BCO not saved after clicking SAVE?

The SAVE only saves the entry on the website but it's not saving to the server. For a new draft, after editing, go to Tools, first select a BCODB, then click on GET PREFIXES to choose a prefix, and lastly, click on SAVE PREFIX. For an existing draft, to save properly, click on SAVE first and then under Tools, select UPDATE DRAFT.

BCO Validation and Error messages

The BCO Portal uses a JSON validator to validate the BCOs, and because of the error messages returned may be a little confusing. Below are some common validation results and an explanation of what they mean and how to address them.

1. "[description_domain][pipeline_steps][0][step_number]": "'1' is not of type 'integer'"

The step_number in the BCO JSON needs to be an INTEGER.

This means it can not be in quotes like this:

    "step_number": "1",

Instead, it must be represented like this:

    "step_number": 1,

You may not be able to see this difference in the COLOR-CODED view, and will have to look in the TREE VIEW JSON or RAW JSON VIEW.