What is Datacentric?

Eight foundational points from the Estes Park Group

  1. Importance. Data is a key asset of any organization.
  2. Permanence and preserved full fidelity. Data migrations are costly, both in process and in information loss. Information loss can mean opportunity loss. It should be possible to store data for an unlimited amount of time.
  3. Contextually derived information. Information is data with context.
  4. Logical, temporal and spatial semantics. Context is the combination of ontological, temporal and spatial semantics about when, where and how the data was collected.
  5. Semantically-derived knowledge. Knowledge is derived from information managed over time.
  6. Model flexibility and openness. An information provider today cannot know the use cases of information consumers of tomorrow. Therefore, creating models with complete context that will fit all the use cases forever is impossible.
  7. Managed for sharability and auditability. Data models and information instances must be computable, sharable, immutable, traceable and uniquely identifiable.
  8. Preserved and immutable. Proper information modeling must be future proof; no data is ever left behind.

Summary

In effect; the data precedes the code and in the model-backed, data-centric world; the model precedes the data.

This is no different in how we approach information system design today. Today:

  • we think about the data
  • we document the data then
  • we build the application to produce the data.

This is the application-centric world of now.

The difference is seen when we go to implementation. In the model-backed, datacentric world:

  • we think about the data
  • we document the data then
  • we build an executable, sharable, structured, semantic model that fully describes the information about a dataset. Then
  • we build applications that produce and process data that meets the requirements of the model.

Now we can build as many varied, purpose-specific applications as needed knowing that the full meaning of the data is available to each one and not tied up in source code and database structures of one application.

What does this mean?

  • Sharable - the models must be sharable across applications. Whether those applications are internal to your enterprise or publically available via the Internet.
  • Structured - fine-grained structural context tells us a lot about the context of data. What were the options to choose from? Were there minimum or maximum values allowed? What language is used? These can gives us a closed-world view of the constraints.
  • Semantic - what are the semantics of the data? The temporal, ontological and spatial contexts as well as definitions and open-world constraints expressed via Semantic Web technologies.
  • Executable - the model must be machine processable using standard, openly available technology.

For more background on datacentricty see the Footnotes [1]

See the S3Model documentation for details about the underlying technology.

The Future

Now that we have the ability to enrich the data in a way that is fully sharable and machine processable. Data scientists will be motivated to continue migrating and improving current algorithms to use semantic web/linked data technologies. This will also lead to new algorithm development that could only be imagined after the availability of rich information is realized.

This approach solves the data quality issues that hamper growth of Generalized Artificial Intelligence .

Footnotes

[1]