FIELD OF THE INVENTION
The invention relates to the structure of an upper ontology and a system and method for utilizing an upper ontology in the creation of one or more multi-relational ontologies.
BACKGROUND OF THE INVENTION
Knowledge within a given domain may be represented in many ways. One form of knowledge representation may comprise a list representing all available values for a given subject. For example, knowledge in the area of “human body tissue types” may be represented by a list including “hepatic tissue,” “muscle tissue,” “epithelial tissue,” and many others. To represent the total knowledge in a given domain, a number of lists may be needed. For instance, one list may be needed for each subject contained in a domain. Lists may be useful for some applications, however, they generally lack the ability to define relationships between the terms comprising the lists. Moreover, the further division and subdivision of subjects in a given domain typically results in the generation of additional lists, which often include repeated terms, and which do not provide comprehensive representation of concepts as a whole.
Some lists, such as structured lists, for example, may enable computer-implemented keyword searching. The shallow information store often contained in list-formatted knowledge, however, may lead to searches that return incomplete representations of a concept in a given domain.
An additional method of representing knowledge is through thesauri. Thesauri are similar to lists, but they further include synonyms provided alongside each list entry. Synonyms may be useful for improving the recall of a search by returning results for related terms not specifically provided in a query. Thesauri still fail, however, to provide information regarding relationships between terms in a given domain.
Taxonomies build on thesauri by adding an additional level of relationships to a collection of terms. For example, taxonomies provide parent-child relationships between terms. “Anorexia is-a eating disorder” is an example of a parent-child relationship via the “is-a” relationship form. Other parent-child relationship forms, such as “is-a-part-of” or “contains,” may be used in a taxonomy. The parent-child relationships of taxonomies may be useful for improving the precision of a search by removing false positive search results. Unfortunately, exploring only hierarchical parent-child relationships may limit the type and depth of information that may be conveyed using a taxonomy. Accordingly, the use of lists, thesauri, and taxonomies present drawbacks for those attempting to explore and utilize knowledge organized in these traditional formats.
Additional drawbacks may be encountered when searches of electronic data sources are conducted. As an example, searches of electronic data sources typically return a voluminous amount of results, many of which tend to be only marginally relevant to the specific problem or subject being investigated. Researchers or other individuals are then often forced to spend valuable time sorting through a multitude of search results to find the most relevant results. It is estimated, for example, that scientists spend 20% of their time searching for information existing in a particular area. This is time that highly-trained investigative researchers must spend simply uncovering background knowledge. Furthermore, when an electronic search is conducted, data sources containing highly relevant information may not be returned to a researcher because the concept sought by the researcher is identified by a different set of terms in the relevant data source. This may lead to an incomplete representation of the knowledge in a given subject area. These and other drawbacks exist.
SUMMARY OF THE INVENTION
The invention addresses these and other drawbacks. According to one embodiment, the invention relates to the structure of an upper ontology and a system and method for utilizing an upper ontology in the creation of one or more multi-relational ontologies. According to one aspect of the invention, the one or more ontologies may include domain specific ontologies that may be used individually or collectively, in whole or in part, based on user preferences, user access rights, or other criteria.
As used herein, a domain may include a subject matter topic, for example, a disease, an organism, a drug, or other topic. A domain may also include one or more entities such as, for example, a person or group of people, a corporation, a governmental entity, or other entities. A domain involving an organization may focus on the organization's activities. For example, a pharmaceutical company may produce numerous drugs or focus on treating numerous diseases. An ontology built on the domain of that pharmaceutical company may include information on the company's drugs, their target diseases, or both. A domain may also include an entire industry such as, for example, automobile production, pharmaceuticals, legal services, or other industries. Other types of domains may be used.
As detailed herein, the invention may utilize a domain specific upper ontology to provide a framework for ontology creation and use. An upper ontology according to the invention may be created especially for the domain in which it is to be used, and may contain detailed and complex associations representing the body of knowledge in that domain.
As used herein, an ontology may include a collection of assertions. An assertion may include a pair of concepts that have some specified relationship. One aspect of the invention relates to the creation of a multi-relational ontology. A multi-relational ontology is an ontology containing pairs of related concepts. For each pair of related concepts there may be a broad set of descriptive relationships connecting them. As each concept within each pair may also be paired (and thus related by multiple descriptive relationships) with other concepts within the ontology, a complex set of logical connections is formed. These complex connections provide a comprehensive “knowledge network” of what is known directly and indirectly about concepts within a single domain. The knowledge network may also be used to represent knowledge between and among multiple domains. This knowledge network allows discovery of complex relationships between the different concepts or concept types in the ontology. The knowledge network enables, inter alia, queries involving both direct and indirect relationships between multiple concepts such as, for example, “show me all genes expressed-in liver tissue that-are-associated-with diabetes.
Another aspect of the invention relates to specifying each concept type and relationship type that may exist in an ontology. These concept types and relationship types may be arranged according to a structured organization. This structured organization may include defining the set of possible relationships that may exist for each pair of concept types (e.g., two concept types that can be related in one or more ways). In one embodiment, this set of possible relationships may be organized as a hierarchy. The hierarchy may include one or more levels of relationships and/or synonyms. In one embodiment, the set of possible concept types and the set of possible relationships that can be used to relate each pair of concept types may be organized as an ontology. As detailed below, these organizational features (as well as other features) enable novel uses of multi-relational ontologies that contain knowledge within a particular domain.
Concept types may themselves be concepts within an ontology (and vice versa). For example, the term “muscle tissue” may exist as a specific concept within an ontology, but may also be considered a concept type within the same ontology, as there may be different kinds of muscle tissue represented within the ontology. As such, a pair of concept types that can be related in one or more ways may be referred to herein as a “concept pair.” Thus, reference herein to “concept pairs” and “concepts” does not preclude these objects from retaining the qualities of both concepts and concept types.
According to one embodiment of the invention, the computer implemented system may include an upper ontology, an extraction module, a rules engine, an editor module, one or more databases and servers, and a user interface module. Additionally, the system may include one or more of a quality assurance module, a publishing module, a path-finding module, an alerts module, and an export manager. Other types of modules may also be used.
According to one embodiment, the upper ontology may store rules regarding the concept types that may exist in an ontology, the relationship types that may exist in an ontology, the specific relationship types that may exist for a given pair of concept types, and the types of properties that those concepts and relationships may have.
Separate upper ontologies may be used for specific domains. For example, an upper ontology may include a domain-specific set of possible concept types and relationship types as well as a definition of which relationship types may be associated with a given concept type.
The upper ontology may also store data source information. For example, the data source information may include information regarding which data source(s) evidence one or more assertions. The information may include one or more of the name of the data source, the data source version, and one or more characteristics of the data source (e.g., is it structured, unstructured, or semi-structured; is it public or private; and other characteristics). The data source information may also include content information that indicates what content is contained in the data source and what can be pulled from the data source. Data source information may also include data regarding licenses (term, renewal dates, or other information) for access to a data source. Other data source information may also be used.
The system may have access to various data sources. These data sources may be structured, semi-structured, or unstructured data sources. The data sources may include public or private databases; books, journals, or other textual materials in print or electronic format; websites, or other data sources. In one embodiment, data sources may also include one or more searches of locally or remotely available information stores, including, for example, hard drives, email repositories, shared files systems, or other information stores. These information stores may be useful when utilizing an organization's internal information to provide ontology services to the organization. From this plurality of data sources, a “corpus” of documents may be selected. A corpus may include a body of documents within the specific domain from which one or more ontologies are to be constructed. As used herein, the term “document” is used broadly and is not limited to text-based documents. For example, it may include database records, web pages, and much more.
A variety of techniques may be used to select the corpus from the plurality of data sources. For example, the techniques may include one or more of manual selection, a search of metadata associated with documents (metasearch), an automated module for scanning document content (e.g., spider), or other techniques. A corpus may be specified for any one or more ontologies, out of the data sources available, through any variety of techniques. For example, in one embodiment, a corpus may be selected using knowledge regarding valid contexts and relationships in which the concepts within the documents can exist. This knowledge may be iteratively supplied by an existing ontology.
The upper ontology may also include curator information. As detailed below, one or more curators may interact with the system. The upper ontology may store information about the curator and curator activity.
In one embodiment of the invention, a data extraction module may be used to extract data, including assertions, from one or more specified data sources. For different ontologies, different data sources may be specified. The rules engine, and rules included therein, may be used by the data extraction module for this extraction. According to one embodiment, the data extraction module may perform a series of steps to extract “rules-based assertions” from one or more data sources. These rules-based assertions may be based on concept types and relationship types specified in the upper ontology, rules in the rules engine, or other rules.
Some rules-based assertions may be “virtual assertions.” Virtual assertions may be created when data is extracted from certain data sources (usually structured data sources). In one embodiment, one or more structured data sources may be mapped to discern their structure. The resultant “mappings” may be considered rules that may be created using, and/or utilized by, the rules engine. Mappings may include rules that bind two or more data fields from one or more data sources (usually structured data sources). The specific assertions created by mappings may not physically exist in the data sources in explicit linguistic form (hence, the term “virtual assertion”), they may be created by applying a mapping to the structured data sources.
Virtual assertions and other rules-based assertions extracted by the extraction module may be stored in one or more databases. For convenience, this may be referred to as a “rules-based assertion store.” According to another aspect of the invention, various types of information related to an assertion may be extracted by the extraction module and stored with the virtual assertions or other assertions within the rules-based assertion store.
In one embodiment, properties may be extracted from the corpus and stored with concept, relationship and assertion data. Properties may include one or more of the data source from which a concept was extracted, the type of data source from which it was extracted, the mechanism by which it was extracted, when it was extracted, the evidence underlying concepts and assertions, confidence weights associated with concepts and assertions, and/or other information. In addition, each concept within an ontology may be associated with a label, at least one relationship, at least one concept type, and/or any number of other properties. In some embodiments, properties may indicate specific units of measurement.
In one embodiment, certain concepts may also be associated with quantitative values. For example, a concept that happens to be a chemical compound may be associated with its particular molecular weight. In some embodiments, quantitative values may also be associated with whole assertions (rather than individual concepts). For example, a statement “gene x is up-regulated in tissue y, by five times” may lead to the assertion “gene x is-up-regulated-in tissue y,” which is itself associated with the quantitative value “5X.”
In some embodiments, concepts in a multi-relational ontology may include documents themselves or parts thereof. Concepts may also include a person. Concepts that are persons may associated with various characteristics of that person such as, for example, the person's name, telephone number, business address, education history, employment history, or other characteristic.
Depending on the type of data source, different steps or combinations of steps may be performed to extract assertions (and related information) from the data sources. For example, for documents originating from structured data sources, the data extraction module may discern (or rules may be stored to map) the structure of a particular structured data source, parse the structured data source, apply mappings, and extract concepts, relationships, assertions, and other information therefrom.
For documents originating from unstructured data and/or semi-structured data sources, a more complex procedure may be necessary or desired. This may include various automated text mining techniques. As one example, it may be particularly advantageous to use ontology seeded natural language processing. Other steps may be performed. For example, if the document is in paper form or hard copy, optical character recognition (OCR) may be performed on the document to produce electronic text. Once the document is formatted as electronic text, linguistic analysis may be performed. Linguistic analysis may include natural language processing (NLP) or other text-mining techniques. Linguistic analysis may identify potentially relevant concepts, relationships, or assertions by tagging parts of speech within the document such as, for example, subjects, verbs, objects, adjectives, pronouns, or other parts of speech.
In one embodiment, rules may be applied to the documents to generate rules-based assertions from the tagged and/or parsed concept, relationship, assertion, or other information within the corpus. The upper ontology of concept and relationship types may be used by the rules to guide the generation of these rules-based assertions.
As mentioned above, the application of rules may be directed by the upper ontology. In defining relationship types that can exist in one or more domain specific ontologies and the rules that can be used for extraction and creation of rule-based assertions, the upper ontology may factor in semantic variations of relationships. Semantic variations may dictate that different words may be used to describe the same relationship. The upper ontology may take this variation into account. Additionally, the upper ontology may take into account the inverse of each relationship type used. As a result, the vocabulary for assertions being entered into the system is accurately controlled. By enabling this rich set of relationships for a given concept, the system of the invention may connect concepts within and across domains, and may provide a comprehensive knowledge network of what is known directly and indirectly about each particular concept.
The upper ontology may also enable flags that factor negation and inevitability of relationships into specific instances of assertions. In some embodiments, certain flags (e.g., negation, uncertainty, or others) may be used with a single form of a relationship to alter the meaning of the relationship. For example, instead of storing all the variations of the relationship “causes” (e.g., does-not-cause, may-cause) the upper ontology may simply add one or more flags to the root form “causes” when specific assertions require one of the variations. For example, a statement from a document such as “compound X does not cause disease Y” may be initially generated as the assertion “compound X causes disease Y.” The assertion may be tagged with a negation flag to indicate that the intended sense is “compound X does-not-cause disease Y.” Similarly, an inevitability flag may be used to indicate that there is a degree of uncertainty or lack of complete applicability about an original statement, e.g., “compound X may-cause disease Y.” These flags can be used together to indicate that “compound X may-not-cause disease Y.” Inverse relationship flags may also be utilized for assertions representing inverse relationships. For example, applying an inverse relationship flag to the relationship “causes” may produce the relationship “is-caused-by.” Other flags may be used alone or in combination with one another.
In one embodiment, the system and/or a curator may curate assertions by undertaking one or more actions regarding assertions within the rules-based assertion store. Examples of actions/processes of curation may include, for example, reifying/validating rules-based assertions (which entails accepting individual, many, or all assertions created by a rule or mapping), identifying new assertions (including those created by inferencing methods), editing assertions, or other actions.
In some embodiments, the actions undertaken in curation may be automated, manual, or a combination of both. For example, manual curation processes may be used when a curator has identified a novel association between two concepts in an ontology that has not previously been present at any level. The curator may directly enter these novel assertions into an ontology in a manual fashion. Manually created assertions are considered automatically validated because they are the product of human thought. However, they may still be subject to the same or similar semantic normalization and quality assurance processes as rules-based assertions. Automated curation processes may be conducted by rules stored by the rules engine. Automated curation may also result from the application of other rules, such as extraction rules. For example, one or more rules may be run against a corpus of documents to identify and extract rules-based assertions. If a rule has been identified as sufficiently accurate (e.g., >98% accurate as determined by application against a test-corpus), the rules-based assertions that it extracts/generates may be automatically considered curated without further validation. If a rule falls below this (or other) accuracy threshold, the assertions it extracts/generates may be identified as requiring further attention. A curator may choose to perform further validation by applying a curation rule or by validating the assertions manually. Automated curation of virtual assertions may be accomplished in a similar fashion. If a mapping (rule) is identified as performing above a certain threshold, a curator may decide to reify or validate all of the virtual assertions in one step. A curator may also decide to reify them individually or in groups.
In some embodiments, curators may also work with and further annotate reified assertions in the same way as rule-based assertions.
Throughout the invention, it may be desirable to document through evidence and properties, the mechanisms by which assertions were created and curated. As such, curator information (e.g., who curated and what they did) may be associated with assertions. Accordingly, curators or other persons may filter out some or all assertions based on curator information, confidence scores, inference types, rules, mechanisms, and/or other properties.
In one embodiment, curation may also include identification of new relationship types, identification of new concept types, and identification of new descendents (instances or parts) of concept types. Assuming a curator or administrative curator is authorized, the curator or administrative curator may edit the upper ontology according to the above identifications using the editor module described below. Editing of the upper ontology may take place during curation of one or more assertions, or at another time.
In one embodiment, curation processes may utilize an editor module. The editor module may include an interface through which a curator interacts with various parts of the system and the data contained therein. The editor module may be used to facilitate various functions. For example, the editor module may enable a curator or suitably authorized individual to engage in various curation processes. Through these curation processes, one or more curators may interact with rules-based assertions and/or create new assertions. Interacting with rules-based assertions may include one or more of viewing rules-based assertions and related information (e.g., evidence sets), reifying rules-based assertions, editing assertions, rejecting the validity of assertions, or performing other tasks. In one embodiment, assertions whose validity has been rejected may be retained in the system alongside other “dark nodes” (assertions considered to be untrue), which are described in greater detail below. The curator may also use the editor module to create new assertions. In some embodiments, the editor module may be used to define and coordinate some or all automated elements of data (e.g., concept, relationship, assertion) extraction.
In one embodiment, the editor module may also enable an authorized individual (e.g., an administrative curator) to create, edit, and/or maintain a domain-specific upper ontology. For example, an administrative curator may specify the set of concept and relationship types and the rules that govern valid relationships for a given concept type. The administrative curator may add or delete concept or relationship types, as well as the set of possible associations between them. The editor module may also enable the management of the propagation of effects from these changes.
In one embodiment, the editing module may also enable an authorized individual, such as an administrative curator, to create, edit, or remove, any of the rules associated with the system such as, for example, rules associated with identifying, extracting, curating, inferring assertions, or other rules. The editor module may also enable an authorized individual to manage the underlying data sources or curator information associated with the system. Managing the underlying data sources may include managing what type of data sources can be used for ontology creation, what specific data sources can be used for specific ontology creation, or other data source management. Managing curator information may include specifying the access rights of curators, specifying what curators are to operate on what data, or other curator specific management.
In one embodiment, the editor module may have a multi-curator mode that enables more than one curator to operate on a particular data set. As with any curation process (single or multiple curator, automated or manual), tags may be placed on the data (e.g., as properties of concepts) regarding who worked on the data, what was done to the data, or other information. This tagging process may enable selective use and review of data based on curator information.
Curation processes may produce a plurality of reified assertions. Reified assertions may be stored in one or more databases. For convenience, this may be referred to as the reified assertion store. The reified assertion store may also include assertions resulting from manual creation/editing, and other non-rule based assertions. The rules-based assertion store and the reified assertion store may exist in the same database or may exist in separate databases. Both the rules-based assertion store and the reified assertion store may be queried by SQL or other procedures. Additionally, both the rules-based and reified assertions stores may contain version information. Version information may include information regarding the contents of the rules-based and/or reified assertion stores at particular points in time.
In one embodiment, a quality assurance module may perform various quality assurance operations on the reified assertion store. The quality assurance module may include a series of rules, which may be utilized by the rules engine to test the internal and external consistency of the assertions that comprise an ontology. The tests performed by these rules may include, for example, certain “mundane” tests such as, for example, tests for proper capitalization or connectedness of individual concepts (in some embodiments, concepts may be required to be connected to at least one other concept). Other tests may exist such as, for example, tests to ensure that concept typing is consistent with the relationships for individual concepts (upstream process/elements such as, for example, various rules and/or the upper ontology generally ensure that these will already be correct, but they still may be checked). More complex tests may include those that ensure semantic consistency. For example, if an individual concept shares 75% of its synonyms with another individual concept, they may be candidates for semantic normalization, and therefore may be flagged for manual curation.
A publishing module may then publish reified assertions as a functional ontology. In connection with publication of reified assertions, the reified assertion store may be converted from a node-centered edit schema, to a graph-centered browse schema. In some embodiments, virtual assertions derived from structured data sources may not be considered “reified.” However, if these virtual assertions are the product of high percentage rules/mappings, they may not require substantive reification during curation and may achieve a nominal “reified” status upon preparation for publication. As such, the conversion from browse schema to edit schema may also serve to reify any of the remaining un-reified virtual assertions in the system (at least those included in publication).
Publication and/or conversion (from edit to browse schema) may occur whenever it is desired to “freeze” a version of an ontology as it exists with the information accumulated at that time and use the accumulated information according to the systems and methods described herein (or with other systems or methods). In some embodiments, the publishing module may enable an administrative curator or other person with appropriate access rights to indicate that the information as it exists is to be published and/or converted (from edit to browse schema). The publishing module may then perform the conversion (from edit to browse schema) and may load a new set of tables (according to the browse schema) in a database. In some embodiments, data stored in the browse schema may be stored in a separate database from the data stored in an edit schema. In other embodiments, it may be stored in the same database.
During extraction and curation, assertions may be stored in an edit schema using a node-centered approach. Node-centered data focuses on the structural and conceptual framework of the defined logical connection between concepts and relationships. In connection with publication, however, assertions may be stored in a browse schema using a graph-centered approach.
Graph-centered views of ontology data may include the representation of assertions as concept-relationship-concept (CRC) “triplets.” In these triplets, two nodes are connected by an edge, wherein the nodes correspond to concepts and the edge corresponds to a relationship.
In one embodiment, CRC triplets may be used to produce a directed graph representing the knowledge network contained in one or more ontologies. A directed graph may include two or more interconnected CRC triplets that potentially form cyclic paths of direct and indirect relationships between concepts in an ontology or part thereof.
The elements and processes described above may be utilized in whole or in part to generate and publish one or more multi-relational, domain-specific ontologies. In some embodiments, not all elements or processes may be necessary. The one or more ontologies may be then used, collectively or individually, in whole or in part, as described below.
Once one or more ontologies are published, they can be used in a variety of ways. For example, one or more users may view one or more ontologies and perform other knowledge discovery processes via a graphical user interface (GUI) as enabled by a user interface module. A path-finding module may enable the paths of assertions existing between concepts of an ontology to be selectively navigated. A chemical support module may enable the storage, manipulation, and use of chemical structure information within an ontology. Also, the system may enable a service provider to provide various ontology services to one or more entities, including exportation of one or more ontologies (or portions thereof), the creation of custom ontologies, knowledge capture services, ontology alert services, merging of independent taxonomies or existing ontologies, optimization of queries, integration of data, and/or other services.
These and other objects, features, and advantages of the invention will be apparent through the detailed description of the preferred embodiments and the drawings attached hereto. It is also to be understood that both the foregoing general description and the following detailed description are exemplary and not restrictive of the scope of the invention.