What is Datacentric?

Eight foundational points from the Estes Park Group

  1. Importance. Data is a vital asset of any organization.
  2. Permanence and preserved full fidelity. Data migrations are costly, both in the process and in information loss. Information loss can mean opportunity loss. Storing data for an unlimited amount of time should be possible.
  3. Contextually derived information. Information is data with context.
  4. Logical, temporal and spatial semantics. Context is the combination of ontological, temporal and spatial meaning about when, where and how the data was collected.
  5. Semantically-derived knowledge. Knowledge is derived from information managed over time.
  6. Model flexibility and openness. An information provider today cannot know the use cases of information consumers of tomorrow. Therefore, creating models with the complete context that fits all the use cases forever is impossible.
  7. Managed for sharability and traceability. Data models and information instances must be computable, sharable, immutable, traceable and uniquely identifiable.
  8. Preserved and immutable. Proper information modeling must be future-proof; no data migrations are required.

Summary

The data precedes the code, and in the model-backed, data-centric world the model precedes the data.

This process is no different in how we currently approach information system design.

Today:

  • we think about the data
  • we document the data then
  • we build the application to produce the data.

This approach is the application-centric world.

The difference between the approaches is evident when we go to implementation. In the model-backed, datacentric world:

  • we think about the data
  • we document the data then
  • we build an executable, sharable, structured, semantic model that thoroughly describes the information about a dataset. Then
  • we build applications that produce and process data that meets the requirements of the model.

Now we can build as many varied, purpose-specific applications as needed knowing that the full meaning of the data is available to each one and not tied up in source code and database structures of the software application.

What does this mean?

  • Shareable - the models must be shareable across applications. Whether those applications are internal to your enterprise or publically available via the Internet, interoperability is paramount.
  • Structured - fine-grained structural context tells us a lot about the meaning of data. What were the options from which to choose? Was there a minimum or maximum value allowed? What language is used to record the information? These can give us a closed-world view of the constraints.
  • Semantic - what are the semantics of the data? The temporal, ontological and spatial contexts, as well as definitions and open-world constraints, expressed via Semantic Web technologies.
  • Executable - the model must be machine processable using standard, openly available technology.

For more background on datacentricty see the Footnotes [1]

See the S3Model documentation for details about the underlying technology.

The Future

Now that we can enrich the data in a way that is wholly shareable and machine processable. Data scientists are motivated to continue migrating and improving current algorithms to use semantic web/linked data technologies. This will also lead to new algorithm development with the availability of comprehensive information.

This approach solves the data quality issues that hamper growth of Generalized Artificial Intelligence .

Footnotes

[1]