Cognitive Computation Group

Curator Data Structures

This page describes the data structures used by the Curator architecture. For the most up to date definitions you should check the Curator interfaces documentation.

Record

Records are the main container of the Curator architecture. Every service or user interacts with a Record. The Record stores all the annotations associated with a given text. Here a text is a String that may contain multiple sentences. We do not restrict the size of utterance that a Record represents. Typically you will have Records represent the natural segmentation of your input collection (for example, paragraphs if paragraphs are easy to detect in your input).

#!c

/**

 * Record's are the objects that hold all annotations on a text.

 */

struct Record {

   /** how to identify this record. */

   1: required string identifier,

   /** The raw text string. */

   2: required string rawText,

   /** Label views.  Contains all the Labelings. */

   3: required map<string, base.Labeling> labelViews,

   /** Cluster views.  Contains all the Clusterings. */

   4: required map<string, base.Clustering> clusterViews,

   /** Parse views.  Contains all the Forests. */

   5: required map<string, base.Forest> parseViews,

   /** General views.  Contains all the Views. */

   6: required map<string, base.View> views,

   /** Was this Record created using a ws* method. */

   7: required bool whitespaced,

}

Span

The Span is the most basic unit within the Curator architecture and corresponds to a portion of the raw text. Spans can have a label and attributes associated with them

 #!c

 /**

  * Span covers a portion of text. Span's can have labels and attributes.

  */

 struct Span {

   /**  start index of span in the raw text (inclusive). */

   1: required i32 start,

   /**  ending index of span in the raw text (exclusive). */

   2: required i32 ending,

   /** label for span. */  

   3: optional string label,

   /** score of span. */

   4: optional double score,

   /** source of span. */

   5: optional Source source,

   /** any additional attributes assoicated with this span. */

   6: optional map<string, string> attributes,

   /** index of the text (in the multirecord) to which this span references. */

   7: optional i32 multiIndex,

 }

Labeling

Typically we want to express a coherent set of labels. The Labeling data structure represents this.

 #!c

 /**

  * A labeling of text.  Really a list of Spans.

  */

 struct Labeling {

   /**  the labels as spans. */

   1: required list<Span> labels,

   /**  the source of this labeling came from. */

   2: optional Source source,

   /** score for this labeling. */

   3: optional double score,

   /** the raw text for this labeling (if null then consult the labeling's parent's rawText field)*/

   4: optional string rawText,

 }

Part-of-Speech example

Here we can see a part-of-speech labeling represented visually.

Clustering

Clusterings over a text are represented as multiple Labelings where each Labeling corresponds to a cluster.

 #!c

 /**

  * A clustering of labels for the text.  Each cluster is represented 

  * as a Labeling which in turn will have labels (list<Span>) 

  * representing each item in the cluster.

  */

 struct Clustering {

   /** the clusters, each cluster is a Labeling. */

   1: required list<Labeling> clusters,

   /** the source of this Clustering */

   2: optional Source source,

   /** score for this clustering */

   3: optional double score,

   /**  the raw text for this clustering (if null then consult the clustering's parent's rawText field)*/

   4: optional string rawText,

 }

Forest

A Forest represents a collection of Trees. For example a Forest could represent the parse trees for every sentence within a text.

 #!c

 /**

  * Forest is a set of trees.

  */

 struct Forest {

   /** the trees in this Forest */

   1: required list<Tree> trees,

   /** the raw text for this Forest (if null then consult the tree's parent's rawText field) */

   2: optional string rawText,

   /** the source of this Forest */

   3: optional Source source,

 }

Tree

The Tree data structure is a collection of Nodes with a pointer (index in the list of nodes) to the top node of the tree.

 #!c

 /**

  * Trees are a set of connected nodes with a top node.

  */

 struct Tree {

   /** list of labeled nodes. */

   1: required list<Node> nodes,

   /** the  index of top/root node in nodes. */

   2: required i32 top,

   /** the source of this tree. */

   3: optional Source source,

   /** the score of this tree. */

   4: optional double score,

 }

Node

Nodes are labeled components of a tree. Each node stores its children (via index pointers again) and the children can be labeled with an edge label. Nodes may also contain a Span and thus cover a portion of a text.

 #!c

 /**

  * Nodes store their children.  Referenced as index into list<Node> in

  * the containing struct.

  * Here the link between Node can be labeled.

  */

 struct Node {

   /** the label of the node. */

   1: required string label,

   /** the span this node covers. */

   2: optional Span span,

   /** the children of the node represented as a map of <child index, edge label>. Empty string implies no label. */

   3: optional map<i32, string> children,

   /** source of the node . */

   4: optional Source source,

   /** the score for this node. */

   5: optional double score,

 }