Advanced Tutorials

Below are additional tutorials to perform after the Getting Started Tutorial/Demo.

It is suggested that you install one or more of the following databases for these tutorials.

AllegroGraph database for creating your semantic graph.

BaseX database for storing your XML documents.

MarkLogic 9 <> database can store all three types of content; XML, RDF and JSON and has many additional features.

If you are not familiar with semantic graph databases please see the following links:

Prerequisites

MarkLogic 9

See the Installation instructions for details.

BaseX

BaseX requires Java 8 for your platform.

Please download the ZIP file and extract it into your home directory.

Start the server using the Client/Server instructions. You will use the client in later parts of the tutorial.

AllegroGraph

Download and Install the server for your platform based on these instructions .

When asked for the superuser username and password use these:

user: admin
password: admin

If you use another username or password, you must edit the entries in kunteksto.conf using a text editor. See below for editing kunteksto.conf.

When the server is installed and running, install the Gruff GUI client for AllegroGraph. You will use this later in the tutorials.

Caution

Only edit the configuration file with a text editor. Do not use a word processing application such as MS Word or LibreOffice. There are many great text editors from which to choose. Some favorites, in no particular order, are:

Configuration

Using a text editor, edit the status entries in kunteksto.conf for [BASEX] and [ALLEGROGRAPH]. Change them from INACTIVE to ACTIVE. When completed they should look like this:

For BaseX:

[BASEX]
status: ACTIVE
host: localhost
port: 1984
dbname: Kunteksto
user: admin
password: admin

For AllegroGraph:

[ALLEGROGRAPH]
status: ACTIVE
host: localhost
port: 10035
repo: Kunteksto
user: admin
password: admin

For MarkLogic:

[MARKLOGIC]
status: ACTIVE
loadxml: True
loadrdf: True
loadjson: True
hostip: 192.168.25.120
hostname: localhost.localdomain
port: 8020
dbname: Kunteksto
forests: 2
user: admin
password: admin

Most users will run MarkLogic on another networked machine or using VirtualBox with a CentOS installation. Use the IP Address of that machine or VirtualBox for the hostip value.

The port is the location that Kunteksto will use to create your REST API for the DB.

For the tutorials just leave the number of forests at 2.

The host name can be found by going to http://<hostip>:8001/default.xqy then look in the box in the lower-right labeled Hosts. You probably only have one and it is labeled Default use this value.

Unless you are using MarkLogic for JSON persistence you will likely want to turn off JSON generation.

; Default data formats to create. Values are True or False.
; These can be changed in the UI before generating data.
xml: True
rdf: True
json: False

Database Checks

From the kunteksto directory run

python utils/db_setup.py

Warning

This python script tests the database connections and installs the S3Model ontology and 3.1.0 Reference Model RDF.

It clears any previously stored data in the databases and reinstalls the required files.

During execution, the script displays several lines of output to the terminal. If you are using AllegroGraph and BaseX then look for AllegroGraph connections are okay. and BaseX connections are okay. or any lines that start with ERROR:.

The MarkLogic checks will display several lines of information as well. As long as the script ends with the message Database Setup is finished. then everything went ok.

Caution

If you see the okay output lines and no ERROR: lines, then all went well. Otherwise, you must troubleshoot these issues before continuing.

Viewing the AllegroGraph RDF Repository

You can view the Kunteksto repository by using this link in a browser. Right click and open it in a new tab. Then under Explore the Repository click the View Triples link. These triples are the S3Model ontology and the S3Model 3.1.0 RDF. These triples connect all of your RDF into a graph, even when they do not have other semantics linking them.

You may also use the Gruff GUI Client to explore the respoitory at any time. See the Franz, Inc. Learning Center for more information.

Using Your Data in MarkLogic

The MarkLogic Developer Network is extensive. They provide an enormous amount of high quality training as well a number of open source tools to assit with data exploration and application design.

Now you have high quality data in a knowledge graph. BI Tools and MarkLogic NoSQL demonstrates in less than 7 minutes how to use external tools to use your data.

The Tutorials begin here.

US Honey Production

The source of this data is from the Kaggle project

The dataset is available here.

Download honeyproduction.csv data set for this tutorial and place it in the kunteksto/example_data directory.

For those without an account on Kaggle, we have included a copy in the example_data directory.

The metadata (click on Data then on the Column Metadata tab) information is useful when filling in the database model and record tables.

You can find more metadata information about this dataset in Wrangling The Honey Production Dataset.

Following the same step by step procedures outlined in the Getting Started section.

  • Navigate to the directory where you installed Kunteksto.
  • Be certain the virtual environment is active.

Caution

If you closed and reopened a new window, then you need to activate the environment again. Also, be sure that you are in the kunteksto directory.

Windows

activate <path/to/directory>

or Linux/MacOSX

source activate <path/to/directory>

For this tutorial, you will run Kunteksto in commandline mode.

kunteksto -m all -i example_data/honeyproduction.csv

Kunteksto takes a few minutes analyzing the input file and creates a results database in the output directory.

The database editor opens and just like in the previous tutorial, prompts you for model metadata which you can collect from the links above. After you click the Save & Exit button. The column editor will open.

Caution

As you edit the data for each column, be sure to persist your changes using the Save button before advancing to the Next column.

As before, each column is presented for you to add constraints and metadata from the information you collect from the links above or from your own personal knowledge. Remember, this is your model of this data. Using the best details creates the best models.

Often we must be creative when deciding which URI to use for a Defining URL. Our suggested approach when you do not have a specific, online vocabulary or ontology is to use resources such as the metadata mentioned above.

For the The state column we might use https://www.kaggle.com/jessicali9530/honey-production/data#state for the Defining URL and then copy the description from that row in the Column Metadata tab.

For additional semantics (Predicates & Objects box) it is best to use open vocabularies when possible. This gives you the ability to easily connect data across models. If you go to the link for open vocabularies and type “State” into the search box you will see a list of options to choose from. A good choice here is to use the one from Schema.org because this is a popular vocabulary for website mark up. We have an Object now we need a Predicate. Since we want to indicate that this is the meaning of this item, type meaning of into the search box on the open vocabularies site. Notice that rdf:type is one of the first choices and the description makes sense. If you put together the two description phrases you get; “The subject is an instance of a class” “A state or province of a country”. The values in this column are instances of (a representation) of a state or province. Therefore we have a good match.

In the Predicates & Objects enter:

rdf:type http://schema.org/State

Click the Save button, then the Next button to move to the The numcol column. Looking at the meatdata you may choose to change the label to something more readable, like Colonies.

Go through each of the column definitions and complete as many data points about each column as you can and that make sense. Feel free to use meaning names as the labels.

Remember also that numeric columns need a Units designator. Also some columns may be detected as integer or decimal and the range of values are outside the boundaries of those types. In this case be sure to change the type to Float.

Columns like The year are detected as integers. However, this is really a temporal value. In Kunteksto we cannot have a temporal datatype with just a year. So change this to String and in the Predicates and Objects box use

rdf:type http://www.w3.org/2001/XMLSchema#gYear

Note

In S3Model it is possible to have all of the temporal types. The Datacentric Tools Suite provides facilities to create these datatypes.

Once you complete editing of all of the columns, click the Exit button. The GUI will remain on the screen while the data generation process is running. The terminal where you started Kunteksto will scroll messages about the progress.

After the processing is complete review output/honeyproduction/honeyproduction_validation_log.csv to see which files are invalid. The error message from the validator may be a bit cryptic but it’s what we have to work with. Just like with the Demo tutorial, the errors are also included in the Semantic Graph via the RDF.

The output RDF will be in the Kunteksto repository in AllegroGraph which you can explore through the AllegroGraph WebView browser tool or using Gruff which I HIGHLY recommend. You can also explore the XML using the BaseX GUI.

There are many written and video tutorials on using these tools. Check the AllegroGraph YouTube Channel and the BaseX Getting Started.

Global Commodity Trade Statistics

Warning

This dataset contains more than 8 million rows of data. If you are using the free version of AllegroGraphDB then processing this file will exceed the 5 million triples limit many times over. The file will still process and all of the XML files will be generated. However, most of the triples will not be stored in AllegroGraph.

The original data set is provided at UNdata Comtrade site.

The best source (easiest download) of this data is the Kaggle Competition

Download the dataset Extract the CSV data and place it the kunteksto/example_data directory.

The metadata (click on Data then on the Column Metadata tab) information may be useful in filling in the database model and record tables. However, it is somewhat incomplete. You can find more metadata information about this dataset in the UNdata Glossary. There is also a knowledgebase that describes how the data was collected and some hints on how to use it. Asyou can see, the metadata is not very organized nor is it computable. S3Model and related datacentric tools allow you to solve this issue with any data of interest.

After you have downloaded the dataset from Kaggle or even a subset from the UNdata site; you are ready to proceed with the tutorial.

Following the same step by step procedures outlined in the Getting Started section.

  • Navigate to the directory where you installed Kunteksto.
  • Be certain the virtual environment is active.

Caution

If you closed and reopened a new window, then you need to activate the environment again. Also, be sure that you are in the kunteksto directory.

Windows

activate <path/to/directory>

or Linux/MacOSX

source activate <path/to/directory>

For this tutorial, you start Kunteksto in commandline mode.

kunteksto -m all -i example_data/commodity_trade_statistics_data.csv

Kunteksto takes a few minutes analyzing the input file and creates a results database in the output directory.

The database editor opens and just like in the previous tutorial, prompts you for model metadata which you can collect from the links above.

Caution

As you edit the data for each column, be sure to persist your changes using the Save button before advancing to the Next column.

As before, each column is presented for you to add constraints and metadata from the information you collect from the links above or from your own personal knowledge. Remember, this is your model of this data. Using the best details creates the best models.

Be sure to check the datatype detected of columns as well as value constraints. For example the “The year” column is detected as an integer column. Obviously this is not valid. For temporals, Kunteksto only offers data, time and datetime options. Using the Datacentric Tool Suite would allow you to create this properly as a Year datatype column. So, using Kunteksto you must choose the most appropriate which in this case is more likely to be String.

Often we must be creative when deciding which URI to use for a Defining URL. Our suggested approach when you do not have a specific, online vocabulary or ontology is to use resources such as the glossary mentioned above. For the The Year column we might use https://comtrade.un.org/db/mr/rfGlossaryList.aspx#Year for the Defining URL and then copy the description from that row in the table.

In the Predicates & Objects we can use

skos:exactMatch http://www.w3.org/2001/XMLSchema#gYear

Go through each of the column definitions and complete as many data points about each column as you can and that make sense. For example, changing the The weight kg column from String to Decimal will help detect missing or invalid values. Then add the ‘kg’ for the units.

The output RDF will be in the Kunteksto repository in AllegroGraph which you can explore through the AllegroGraph WebView browser tool or using Gruff which I HIGHLY recommend. You can also explore the XML using the BaseX GUI.

There are many written and video tutorials on using these tools. Check the AllegroGraph YouTube Channel and the BaseX Getting Started.