Dagoba: in-memory Graph DB

index.en

A graph database is a database system designed to model and traverse complex relationships and networks effectively. It is especially useful when relationships are a central aspect of the data. The main use cases for graph databases are as follows:

Social networking: Social networks are ideal for representing relationships between people as a graph. Users, friendships, group memberships, and so on can be modeled as nodes and edges, which makes it possible to implement social network analysis, recommendation systems, and more.
Recommendation systems: By storing products, users, purchase histories, and the like in a graph database, you can analyze the complex relationships between users and products and provide personalized recommendations.

The problems above can be solved well enough with a relational database too. However, with a graph database you can solve even more complex queries with better performance.

What if you wanted to find out who — and how many — David’s parent’s parent’s parent’s child’s child’s children are? Or, in a social networking service, how would you find someone’s second-degree connections that are not first-degree?

In SQL, the query to find David’s relatives would have to be implemented somewhat like this.

1
WITH RECURSIVE Ancestors AS (
2
    SELECT parent_id AS ancestor
3
    FROM ParentChild
4
    WHERE child_id = (
5
        SELECT id
6
        FROM Person
7
        WHERE name = 'David'
8
    )
9
    UNION ALL
10
    SELECT pc.parent_id
11
    FROM ParentChild pc
12
    INNER JOIN Ancestors a ON pc.child_id = a.ancestor
13
),
14
Descendants AS (
15
    SELECT child_id AS descendant
16
    FROM ParentChild
17
    WHERE parent_id IN (SELECT ancestor FROM Ancestors)
18
    UNION ALL
19
    SELECT pc.child_id
20
    FROM ParentChild pc
21
    INNER JOIN Descendants d ON pc.parent_id = d.descendant
22
)
23
SELECT DISTINCT p.name
24
FROM Descendants d
25
JOIN Person p ON d.descendant = p.id;

By contrast, the query in a graph database looks like this.

1
graph
2
  .v('David')
3
  .out('parent')
4
  .out('parent')
5
  .out('parent')
6
  .in('parent')
7
  .in('parent')
8
  .in('parent')
9
  .unique()
10
  .run()

Isn’t the graph database query far more intuitive to read? It can also be more advantageous in terms of memory and performance. Following on, the way to find a user’s second-degree connections that are not first-degree would look like this.

1
graph
2
  .v(1)
3
  .out('friend')
4
  .aggregate('firstDegree') // mark as first-degree friends
5
  .out('friend')
6
  .except('firstDegree') // exclude first-degree
7
  .unique()
8
  .property('name')
9
  .run()

So, on what principles can a graph database be built? Let’s take a look at those principles.

Build Graph

Before creating the Query class that interprets the database’s query statements, let’s first build the class that makes up the basic Graph.

1
type VertexId = number | string
2

3
interface Vertex {
4
  _id: VertexId
5
  name: string
6
  _in: Edge[]
7
  _out: Edge[]
8
}
9
interface Edge {
10
  _in: Vertex
11
  _out: Vertex
12
  _label?: string
13
}
14

15
type PartialVertex = {
16
  _id?: VertexId
17
  name: string
18
}
19
type PartialEdge = {
20
  _in: VertexId
21
  _out: VertexId
22
  _label?: string
23
}
24

25
class Graph {
26
  vertices: Vertex[] = []
27
  edges: Edge[] = []
28
  private vertexIndex: Record<VertexId, Vertex> = {}
29
  private autoid: number = 1
30

31
  constructor(V: PartialVertex[], E: PartialEdge[]) {
32
    this.addVertices(V)
33
    this.addEdges(E)
34
  }
35
}

It is simply a directed graph made up of vertices and edges. If no _id value is provided when a vertex is created, one is filled in automatically via autoid.

1
class Graph {
2
  private createVertex({ _id, name, ...rest }: PartialVertex): Vertex {
3
    if (!_id) _id = this.autoid++
4

5
    return {
6
      _id,
7
      name,
8
      ...rest,
9
      _in: [],
10
      _out: [],
11
    }
12
  }
13

14
  public addVertex(partialVertex: PartialVertex): VertexId {
15
    const vertex = this.createVertex(partialVertex)
16

17
    const existingVertex = this.findVertexById(vertex._id)
18
    if (existingVertex)
19
      throw new Error('A vertex with id ' + vertex._id + ' already exists')
20

21
    this.vertices.push(vertex)
22
    this.vertexIndex[vertex._id] = vertex
23
    return vertex._id
24
  }
25

26
  public addEdge(partialEdge: PartialEdge): void {
27
    const inVertex = this.findVertexById(partialEdge._in)
28
    const outVertex = this.findVertexById(partialEdge._out)
29

30
    if (!inVertex || !outVertex)
31
      throw new Error("One of the vertices for the edge wasn't found")
32

33
    const edge: Edge = {
34
      _in: inVertex,
35
      _out: outVertex,
36
      _label: partialEdge._label,
37
    }
38

39
    outVertex._out.push(edge)
40
    inVertex._in.push(edge)
41
    this.edges.push(edge)
42
  }
43

44
  private addVertices(vertices: PartialVertex[]): void {
45
    vertices.forEach((vertex) => this.addVertex(vertex))
46
  }
47

48
  private addEdges(edges: PartialEdge[]): void {
49
    edges.forEach((vertex) => this.addEdge(vertex))
50
  }
51

52
  private findVertexById(vertex_id: VertexId): Vertex {
53
    return this.vertexIndex[vertex_id]
54
  }
55
}

This is the actual code that creates each vertex and edge. Since today’s topic is Query, we’ll just take a quick look and move on.

Query

Now let’s look at the part that performs a Query against a graph containing data.

1
type GremlinState = Record<string, any>
2

3
interface Gremlin {
4
  vertex: Vertex
5
  state: GremlinState
6
}
7

8
interface State {
9
  vertices?: Vertex[]
10
  edges?: Edge[]
11
  gremlin?: Gremlin
12
}
13

14
type Step = [string, any[]] // [pipetype, args]
15

16
class Query {
17
  graph: Graph
18
  state: State[]
19
  program: Step[]
20
  gremlins: Gremlin[]
21

22
  constructor(graph: Graph) {
23
    this.graph = graph
24
    this.state = []
25
    this.program = []
26
    this.gremlins = []
27
  }
28
}

The basic structure of a query looks like the above. It’s quite unfamiliar. Let’s go through it one piece at a time.

program is the variable that stores each step of the query. Each step consists of a pipe, which we’ll explain later.

Each step can hold state, and the query’s state stores the state corresponding to each step’s index.

A gremlin is a creature that moves through the graph according to our commands. The gremlin remembers where it has been and helps us find the answer to our query.

In graph DBs, the origin of the creature called a gremlin traces back to the Gremlin language.

The graph language we wrote earlier to find David’s relatives is the Gremlin query language.

1
graph
2
  .v('David')
3
  .out('parent')
4
  .out('parent')
5
  .out('parent')
6
  .in('parent')
7
  .in('parent')
8
  .in('parent')

The function corresponding to each pipeline (step) is called a pipe, and in the example above the v, out, and in pipes were used. To store the pipe of each step, we add an add function to the Query class that appends a pipe to the program, and we add a v function on Graph that can start a query.

1
class Graph {
2
  ...
3
  public v(...args: any[]) {
4
    const query = new Query(this)
5
    query.add('vertex', args)
6
    return query
7
  }
8
}
9

10
class Query {
11
  ...
12
  add(pipetype: string, args: any[]) {
13
    const step: Step = [pipetype, args]
14
    this.program.push(step)
15
    return this
16
  }
17
}

So what does pipetype mean here? We’ll find the answer in the next section.

Eager Loading Problem

Before looking at pipetype, we need to consider the strategy for executing the query statement we’re writing.

The first is a strategy that executes immediately as soon as a pipeline is encountered, and the second is a strategy that does not execute until it is truly needed. We’ll call the first one eager loading and the second one lazy loading.

If we execute the query with an eager loading strategy, a performance problem can arise when we look for David’s relatives. Even if the relatives we’re after actually number in the tens of thousands, that probably isn’t really the result we want. When displaying it in a service, we limit the nodes by filtering down to just as many as we’ll show, like .take(20). The query looks like this.

1
graph
2
  .v('David')
3
  .out('parent')
4
  .out('parent')
5
  .out('parent')
6
  .in('parent')
7
  .in('parent')
8
  .in('parent')
9
  .take(20)

Because .take is at the very end, this can cause a problem. By contrast, if we process it with lazy loading, we only need to emit 20 results at the end, so we can avoid wasting unnecessary resources. In a graph database, it would be wise for us to handle execution in a lazy loading manner.

Now let’s build implementations of various pipetypes such as vertex, in, out, and take so they can be used when actually running queries. Before that, there is one thing we need to look at: the implementation of lazy loading for each pipeline.

Query run: interpreter

As discussed earlier, we want to build a query execution strategy based on lazy loading. To do that, we first need to read the query’s method chaining all the way to the end while storing every pipetype, and we also need to be able to trace back according to the execution plan. Writing the run function partially, it looks like this.

1
class Query {
2
  graph: Graph
3
  state: State[]
4
  program: Step[]
5
  gremlins: Gremlin[]
6

7
  constructor(graph: Graph) {
8
    this.graph = graph
9
    this.state = []
10
    this.program = []
11
    this.gremlins = []
12
  }
13

14
  run() {
15
    const max = this.program.length - 1
16
    let currentResult: MaybeGremlin = false
17
    const results = []
18
    let done = -1
19
    let pc = max
20

21
    while (done < max) {
22
      const [operationType, parameters] = this.program[pc]
23
      const state = (this.state[pc] = this.state[pc] || {})
24
      const operate = Dagoba.getPipetype(operationType)
25

26
      currentResult = operate(this.graph, parameters, currentResult,
27
                              state, pc - 1 <= done)

Variable descriptions

graph: The structure of the graph database.
state: An array that stores the current state information for each step.
program: An array that holds the sequence of operations to execute. For example, if the query graph.v(1).out('knows').out().take(2) exists, program would be something like [['v', [1]], ['out', ['knows']], ['out', []], ['take', [2]]].
max: The index of the last operation to execute.
currentResult: A temporary variable that stores the result of the current operation.
results: An array that stores the final results.
done: The index of the last completed operation.
pc: The program counter, the index of the operation currently being executed.

Execution flow

Initialization: pc is initialized to the index of the last operation (max). done is set to -1, indicating that initially no operations have been completed.
Main loop: The loop runs while done is less than max. It continues until all operations are completed.
Operation execution: The operation type and parameters are extracted from program[pc], and the function matching that operation type (Dagoba.getPipetype(operationType)) is called and executed. The result of the operation is stored in currentResult.

The next part of the logic looks like this.


32 collapsed lines
1
class Query {
2
  graph: Graph
3
  state: State[]
4
  program: Step[]
5
  gremlins: Gremlin[]
6

7
  constructor(graph: Graph) {
8
    this.graph = graph
9
    this.state = []
10
    this.program = []
11
    this.gremlins = []
12
  }
13

14
  run() {
15
    const max = this.program.length - 1
16
    let currentResult: MaybeGremlin = false
17
    const results = []
18
    let done = -1
19
    let pc = max
20

21
    while (done < max) {
22
      const [operationType, parameters] = this.program[pc]
23
      const state = (this.state[pc] = this.state[pc] || {})
24
      const operate = Dagoba.getPipetype(operationType)
25

26
      currentResult = operate(
27
        this.graph,
28
        parameters,
29
        currentResult,
30
        state,
31
        pc - 1 <= done,
32
      )
33
      if (currentResult == 'pull') {
34
        currentResult = false
35
        if (pc - 1 > done) {
36
          pc--
37
          continue
38
        } else {
39
          done = pc
40
        }
41
      }
42

43
      if (currentResult == 'done') {
44
        currentResult = false
45
        done = pc
46
      }
47

48
      pc++
49

50
      if (pc > max) {
51
        if (currentResult) results.push(currentResult)
52
        currentResult = false
53
        pc--
54
      }
55
    }
56

57
    return results.map((gremlin) =>
58
      gremlin.result != null ? gremlin.result : gremlin.vertex,
59
    )
60
  }
61
}

Result handling:
- ‘pull’ result: If the current operation requires more input (pull), currentResult is set to false, and we go back to the previous step (pc - 1) and continue the operation.
- ‘done’ result: If the current operation is complete (done), currentResult is set to false, and done is updated to the current pc value.
Index adjustment: The pc value is incremented to move to the next operation. If pc is greater than max, then if the current result (currentResult) is valid it is added to the results array, currentResult is set to false, and pc is decremented to go back to the previous step.
Returning results: The final results array is mapped to extract the result of each Gremlin and return it. If there is no result (null), the Gremlin’s vertex is returned.

In other words, depending on the result of each pipetype, pc can continuously move left or right. When pc moves to the right, the result of the previously performed operation is stored in currentResult as a gremlin (or as false, etc.).

The role of `done`

A done situation arises when a particular operation is complete and there is no longer any need to proceed. Looking closely at the role of done in the code and when it is set, done is updated in the following two main situations:

When ‘pull’ is returned, and when we cannot go back to a previous step:
- When currentResult returns ‘pull’, it means the current operation needs additional data or input. In this case, pc (the program counter) is decremented to perform the operation of the previous step again.
- If pc - 1 is less than or equal to done (if (pc - 1 <= done)), all of the previous steps’ operations are considered already complete, so there is no previous step left to go back to. At this point, done is set to the current pc value, marking everything up to the current operation as complete.
When an operation returns ‘done’:
- When an operation function returns a ‘done’ result, it means that operation is fully complete. Therefore, currentResult is set to false and done is updated to the current pc value. This indicates that the operation is complete and there is no longer any need to execute it.

These two situations indicate the points where an operation should stop rather than proceed further, and they are important checkpoints in the query execution process.

As a result, running the run function returns the result nodes we want, contained in the results variable.

Pipetypes

Pipetypes are the core of the dagoba system. Understanding how each one works lets you better understand how they are called and ordered in the interpreter.

1
class Dagoba {
2
  static Pipetypes: Record<string, Function> = {}
3

4
  static addPipetype(name: string, fun: Function) {
5
    Dagoba.Pipetypes[name] = fun
6
    Query.prototype[name] = function () {
7
      return this.add(name, [].slice.apply(arguments))
8
    }
9
  }
10
  static getPipetype(name: string) {
11
    return Dagoba.Pipetypes[name]
12
  }
13
}

Calling the addPipetype function adds a pipe type that can be invoked in a query. And inside the Query class, when each pipeline is evaluated, the pipetype function is retrieved by calling the getPipetype function.

pipetype: vertex

The vertex pipetype is responsible for creating gremlins at the relevant node. It gathers all vertices that match the condition and creates gremlins one at a time.

1
Dagoba.addPipetype('vertex', function (graph, args, gremlin, state) {
2
  if (!state.vertices) state.vertices = graph.findVertices(args) // state initialization
3

4
  if (!state.vertices.length)
5
    // all done
6
    return 'done'
7

8
  const vertex = state.vertices.pop() // OPT: this relies on cloning the vertices
9
  return Dagoba.makeGremlin(vertex, gremlin.state) // we can have incoming gremlins from as/back queries
10
})

pipetype: in-out

Traversing to the next node along the edges connected to a graph node is very easy.

1
Dagoba.addPipetype('in', simpleTraversal('in'))
2
Dagoba.addPipetype('out', simpleTraversal('out'))

In the simpleTraversal function, the ‘in’ and ‘out’ directions represent the edges coming into and going out of the relevant node, respectively. This function is responsible for the process of traversing to the next node in the graph structure.

1
const simpleTraversal = function (dir) {
2
  // handles basic in and out pipetypes
3
  const find_method = dir == 'out' ? 'findOutEdges' : ('findInEdges' as const)
4
  const edge_list = dir == 'out' ? '_in' : ('_out' as const)
5

6
  return function (graph, args, gremlin, state) {
7
    if (!gremlin && (!state.edges || !state.edges.length))
8
      // query initialization
9
      return 'pull'
10

11
    if (!state.edges || !state.edges.length) {
12
      // state initialization
13
      state.gremlin = gremlin
14
      state.edges = graph[find_method](gremlin.vertex) // get edges that match our query
15
        .filter(Dagoba.filterEdges(args[0]))
16
    }
17

18
    if (!state.edges.length)
19
      // all done
20
      return 'pull'
21

22
    const vertex = state.edges.pop()[edge_list] // use up an edge
23
    return Dagoba.gotoVertex(state.gremlin, vertex)
24
  }
25
}

In the function’s execution flow, there is first a query- and state-initialization step. If no Gremlin object exists or there are no more edges to process, it returns ‘pull’ to request additional information. If a Gremlin exists but its initial state has not yet been set, it finds the edges in the appropriate direction from the current vertex and adds them to the state information.

Next, the edge-processing step proceeds. If there are no edges left to process, it again returns ‘pull’ to stop the traversal. If there are edges left, it uses one of them to move to the next vertex, then creates and returns a new Gremlin object so it can be handled by the next pipetype.

pipetype: property

The property pipetype is responsible for extracting the value of a specific property of a node.

1
Dagoba.addPipetype('property', function (graph, args, gremlin, state) {
2
  if (!gremlin) return 'pull' // query initialization
3
  gremlin.result = gremlin.vertex[args[0]]
4
  return gremlin.result == null ? false : gremlin // undefined or null properties kill the gremlin
5
})

The function’s execution flow first includes a query-initialization step. In this step, if no Gremlin object exists, the function returns ‘pull’.

It extracts the desired property from the Gremlin object’s current node and sets it as the result. If the property value that gets set is null or undefined, the function returns false, which terminates the Gremlin’s traversal.

pipetype: unique

The unique pipetype prevents visiting duplicate nodes, which improves the efficiency of data processing and removes unnecessary repetition during traversal.

1
Dagoba.addPipetype('unique', function (graph, args, gremlin, state) {
2
  if (!gremlin) return 'pull' // query initialization
3
  if (state[gremlin.vertex._id]) return 'pull' // we've seen this gremlin, so get another instead
4
  state[gremlin.vertex._id] = true
5
  return gremlin
6
})

First, the function checks whether a Gremlin object exists. If there is no Gremlin object, it returns ‘pull’ to request additional data in the query-initialization step.

Next, the function checks whether the node visited by the current Gremlin has already been visited. To do this, it uses the state object state, storing whether each node has been visited using the node’s unique ID as the key. If the current node has already been visited, it returns ‘pull’ to request another Gremlin.

If the current node has not been visited before, it stores that node’s ID in the state object and returns the Gremlin object as is. By doing so, the function ensures each node is traversed only once.

pipetype: filter

The filter pipetype lets you specify filtering conditions in various ways, and it continues the traversal only for nodes that satisfy the condition.

1
Dagoba.addPipetype('filter', function (graph, args, gremlin, state) {
2
  if (!gremlin) return 'pull' // query initialization
3

4
  if (typeof args[0] == 'object')
5
    // filter by object
6
    return Dagoba.objectFilter(gremlin.vertex, args[0]) ? gremlin : 'pull'
7

8
  if (typeof args[0] != 'function') {
9
    Dagoba.error('Filter arg is not a function: ' + args[0])
10
    return gremlin // keep things moving
11
  }
12

13
  if (!args[0](gremlin.vertex, gremlin)) return 'pull' // gremlin fails filter function
14
  return gremlin
15
})

As with the other pipetype functions, if there is no Gremlin it returns ‘pull’. After that, filtering is done depending on the type of the argument (object or function), and if it passes, the Gremlin is returned as is for the next traversal.

pipetype: take

The take pipetype selects only a specified number of nodes from among the nodes being traversed in the graph database, and stops the traversal after that.

1
Dagoba.addPipetype('take', function (graph, args, gremlin, state) {
2
  state.taken = state.taken || 0 // state initialization
3

4
  if (state.taken == args[0]) {
5
    state.taken = 0
6
    return 'done' // all done
7
  }
8

9
  if (!gremlin) return 'pull' // query initialization
10
  state.taken++ // THINK: if this didn't mutate state, we could be more
11
  return gremlin // cavalier about state management (but run the GC hotter)
12
})

The function first checks the initialization of the state object state. Here, state.taken is the variable that stores the number of nodes taken so far, and it is initially set to 0.

During the traversal, it first checks whether it has already processed as many nodes as the specified count (args[0]). If it has already processed the specified number of nodes, it resets state.taken to 0 and returns ‘done’ to terminate the traversal.

Then it checks whether a Gremlin object exists. If there is no Gremlin object, it returns ‘pull’ to request additional data and proceeds with query initialization.

If a Gremlin object exists, it increments state.taken by 1 to update the number of nodes processed so far. After that, it returns the current Gremlin object to continue the traversal.

pipetype: as

The as pipetype provides the ability to tag a node being traversed in the graph database with a specific label. This makes it possible to reference or revisit a specific point later.

1
Dagoba.addPipetype('as', function (graph, args, gremlin, state) {
2
  if (!gremlin) return 'pull' // query initialization
3
  gremlin.state.as = gremlin.state.as || {} // initialize gremlin's 'as' state
4
  gremlin.state.as[args[0]] = gremlin.vertex // set label to the current vertex
5
  return gremlin
6
})

First, the function checks whether a Gremlin object exists. If there is no Gremlin object, it returns ‘pull’ to request the additional data needed in the query-initialization step.

Next, it initializes the Gremlin object’s ‘as’ state. Here, gremlin.state.as is an object that stores the nodes tagged so far. If this state has not yet been set, it is initialized to a new object.

The function assigns the node currently being traversed (gremlin.vertex) to a specific label (args[0]). This label is the argument the user specified when calling the function.

pipetype: except

The except pipetype excludes specific nodes during the graph database’s traversal process.

1
Dagoba.addPipetype('except', function (graph, args, gremlin, state) {
2
  if (!gremlin) return 'pull' // query initialization
3

4
  if (
5
    gremlin.state.as &&
6
    gremlin.state.as[args[0]] &&
7
    gremlin.vertex == gremlin.state.as[args[0]]
8
  )
9
    return 'pull' // TODO: check for nulls
10
  return gremlin
11
})

The function checks whether a Gremlin object exists. If no Gremlin object exists, it returns ‘pull’ to proceed with the query-initialization step along with the request for additional data.

First, it checks whether the node of a specific label stored in the gremlin.state.as object is the same as the node currently being traversed. This label is specified by the argument (args[0]) provided when the function is called. If the current node is identical to the node tagged with this label, it returns ‘pull’ to exclude that node from the traversal results.

pipetype: back

The back pipetype provides the ability to return, during the graph database’s traversal process, to a node tagged with a specific label.

1
Dagoba.addPipetype('back', function (graph, args, gremlin, state) {
2
  if (!gremlin) return 'pull' // query initialization
3
  return Dagoba.gotoVertex(gremlin, gremlin.state.as[args[0]]) // TODO: check for nulls
4
})

First, it references the as object stored inside the Gremlin object’s state. Inside this object, labels and references to the nodes assigned to those labels are stored. The function uses this information to return to the node of the label specified by the argument provided at call time.

If a node corresponding to the specified label exists, it uses the Dagoba.gotoVertex function to move the Gremlin’s position to that node.

Examples

Using the query and pipetypes implemented above, we can build several examples.

Acquaintance relationships

1
class DagobaTest1 {
2
  static run() {
3
    const V = [
4
      { name: 'alice' }, // alice gets auto-_id (prolly 1)
5
      { _id: 10, name: 'bob', hobbies: ['asdf', { x: 3 }] },
6
    ]
7
    const E = [{ _out: 1, _in: 10, _label: 'knows' }]
8

9
    const graph = Dagoba.graph(V, E)
10

11
    graph.addVertex({ name: 'charlie', _id: 'charlie' })
12
    graph.addVertex({ name: 'delta', _id: '30' })
13
    graph.addEdge({ _out: 10, _in: 30, _label: 'parent' })
14
    graph.addEdge({ _out: 10, _in: 'charlie', _label: 'knows' })
15

16
    // @ts-ignore
17
    const result1 = graph.v(1).out('knows').out().run()
18
    console.log(result1)
19

20
    // @ts-ignore
21
    const qb = graph.v(1).out('knows').out().take(1).property('name')
22
    const result2 = qb.run()
23
    const result3 = qb.run()
24
    const result4 = qb.run()
25
    console.log(result2, result3, result4)
26
  }
27
}
28

29
DogobaTest1.run()

Building the graph, it looks like this.

First query execution (result1):

graph.v(1).out('knows').out().run(): Starting from the node with ID 1 (Alice), it moves to the node connected via the ‘knows’ relationship (Bob), and then traverses the nodes related to Bob. Since Bob has relationships with two nodes, Delta and Charlie, these nodes are returned as the result.

Second query series (result2, result3, result4):

graph.v(1).out('knows').out().take(1).property('name'): This query chain starts from Alice, goes through Bob, selects one (take(1)) of the nodes connected to him, and extracts the name property of the selected node. Since take(1) is applied, the first execution returns the name of the first node related to Bob (either Delta or Charlie) as the result. The second run() returns the name of the second node, and the third returns an empty array because there are no more nodes to return.

1
result1: [
2
  { _id: 'charlie', name: 'charlie', _in: [[Object]], _out: [] },
3
  { _id: '30', name: 'delta', _in: [[Object]], _out: [] },
4
]
5
result2: ['charlie']
6
result3: ['delta']
7
result4: []

Course enrollment

1
class DagobaTest2 {
2
  static run() {
3
    const V = [
4
      { name: 'Alice', _id: 1, type: 'student' },
5
      { name: 'Bob', _id: 2, type: 'student' },
6
      { name: 'Charlie', _id: 3, type: 'student' },
7
      { name: 'Mathematics', _id: 4, type: 'course' },
8
      { name: 'Computer Science', _id: 5, type: 'course' },
9
    ]
10

11
    const E = [
12
      { _out: 1, _in: 4, _label: 'enrolled' },
13
      { _out: 1, _in: 5, _label: 'enrolled' },
14
      { _out: 2, _in: 4, _label: 'enrolled' },
15
      { _out: 3, _in: 5, _label: 'enrolled' },
16
    ]
17

18
    const graph = Dagoba.graph(V, E)
19

20
    const dualEnrolledStudents = graph
21
      .v()
22
      // @ts-ignore
23
      .filter({ type: 'student' })
24
      .as('student')
25
      .out('enrolled')
26
      .filter({ name: 'Mathematics' })
27
      .back('student')
28
      .out('enrolled')
29
      .filter({ name: 'Computer Science' })
30
      .back('student')
31
      .unique()
32
      .property('name')
33
      .run()
34
    console.log(
35
      'Students enrolled in both Mathematics and Computer Science:',
36
      dualEnrolledStudents,
37
    )
38
  }
39
}
40
DogobaTest2.run()

Building the graph, it looks like this.

The query’s purpose is to find students who are enrolled in both Mathematics and Computer Science. It may look a bit complex, but let’s interpret it step by step.

Filtering students: Among all vertices, select only those whose type is ‘student’.
Setting the student tag: Tag each student vertex with the label ‘student’.
Filtering the Mathematics course: Among the courses each student is enrolled in, filter for those enrolled in “Mathematics”.
Returning to the original student vertex: Use the ‘student’ tag to return to the original student vertex. If, during the Mathematics course filtering, no gremlin existed because there was no such course, it does not return to the original student and instead returns ‘pull’.
Filtering the Computer Science course: Filter again for courses where the same student is enrolled in “Computer Science”. Likewise, if no gremlin existed during the Computer Science course filtering because there was no such course, it does not return to the original student and instead returns ‘pull’.
Removing duplicates: Reaching this point means the student was enrolled in both courses. Use the unique function to remove duplicates so the same student information is not repeated.
Extracting the student’s name: Finally, extract the ‘name’ property of the selected students.

1
Students enrolled in both Mathematics and Computer Science: [ 'Alice' ]

Second-degree but not first-degree

1
class DagobaTest3 {
2
  static run() {
3
    const V = [
4
      { _id: 1, name: 'User1' },
5
      { _id: 2, name: 'User2' },
6
      { _id: 3, name: 'User3' },
7
      { _id: 4, name: 'User4' },
8
      { _id: 5, name: 'User5' },
9
      { _id: 6, name: 'User6' },
10
      { _id: 7, name: 'User7' },
11
      { _id: 8, name: 'User8' },
12
      { _id: 9, name: 'User9' },
13
      { _id: 10, name: 'User10' },
14
    ]
15

16
    const E = [
17
      { _out: 1, _in: 2, _label: 'friend' },
18
      { _out: 1, _in: 3, _label: 'friend' },
19
      { _out: 1, _in: 4, _label: 'friend' },
20
      { _out: 1, _in: 9, _label: 'friend' },
21
      { _out: 1, _in: 10, _label: 'friend' },
22
      { _out: 2, _in: 5, _label: 'friend' },
23
      { _out: 2, _in: 6, _label: 'friend' },
24
      { _out: 3, _in: 7, _label: 'friend' },
25
      { _out: 3, _in: 8, _label: 'friend' },
26
      { _out: 4, _in: 9, _label: 'friend' },
27
      { _out: 4, _in: 10, _label: 'friend' },
28
      { _out: 5, _in: 10, _label: 'friend' },
29
      { _out: 6, _in: 10, _label: 'friend' },
30
      { _out: 7, _in: 1, _label: 'friend' },
31
      { _out: 8, _in: 9, _label: 'friend' },
32
      { _out: 9, _in: 2, _label: 'friend' },
33
      { _out: 9, _in: 6, _label: 'friend' },
34
      { _out: 10, _in: 3, _label: 'friend' },
35
    ]
36

37
    const graph = Dagoba.graph(V, E)
38

39
    const secondDegreeFriendsOfUser3 = graph
40
      .v(1) // start from user 1
41
      // @ts-ignore
42
      .out('friend') // first-degree friends
43
      .aggregate('firstDegree') // mark as first-degree friends
44
      .out('friend') // second-degree friends
45
      .except('firstDegree') // exclude first-degree
46
      .unique() // remove duplicates
47
      .property('name') // extract name
48
      .run() // run the query
49

50
    console.log('2nd degree friends of User 1:', secondDegreeFriendsOfUser3)
51
  }
52
}
53
DogobaTest3.run()

Building the graph, it looks like this.

This is the query mentioned at the beginning of this article. The query itself is simple.

1
const secondDegreeFriendsOfUser3 = graph
2
  .v(1) // start from user 1
3
  // @ts-ignore
4
  .out('friend') // first-degree friends
5
  .aggregate('firstDegree') // mark as first-degree friends
6
  .out('friend') // second-degree friends
7
  .except('firstDegree') // exclude first-degree
8
  .unique() // remove duplicates
9
  .property('name') // extract name
10
  .run() // run the query d

However, there is one pipetype here that we haven’t seen before: aggregate. Why couldn’t we just designate the first-degree friends with as and exclude them?

The main reason for using the aggregate pipetype is to collect multiple items and aggregate the results processed at a particular step partway through, so they can be referenced later. The as pipetype and the aggregate pipetype can perform similar functions, but they differ in purpose and usage.

as is mainly used to mark a specific point in the current node or path so that a “back” operation can be performed in a later query step. It is mainly used to store a reference to a single element, and it can tag only one node at a time. Therefore, it is not suitable for aggregating many nodes or for grouping multiple nodes to be reused in a later query.

aggregate can collect multiple nodes into a collection, and this collection can become the target of filtering or other operations in later query steps. So let’s additionally implement the aggregate pipetype and tweak the code a bit so that except can also recognize this aggregate.

1
Dagoba.addPipetype('aggregate', function (graph, args, gremlin, state, done) {
2
  if (!done && !gremlin) return 'pull' // query initialization
3
  state.aggregate = state.aggregate || {} // initialize gremlin's 'aggregate' state
4
  if (!state.aggregate[args[0]]) state.aggregate[args[0]] = []
5

6
  if (done) {
7
    if (!state.aggregated) {
8
      state.aggregated = true
9
      state.vertices = [...state.aggregate[args[0]]]
10
    }
11

12
    const vertex = state.vertices.pop()
13
    if (!vertex) return 'pull'
14
    const gremlin = Dagoba.makeGremlin(vertex, state)
15
    return gremlin
16
  }
17

18
  state.aggregate[args[0]].push(gremlin.vertex)
19
  gremlin.state.aggregate = gremlin.state.aggregate || {}
20
  gremlin.state.aggregate[args[0]] = state.aggregate[args[0]]
21

22
  return 'pull'
23
})
24

25
Dagoba.addPipetype('except', function (graph, args, gremlin, state) {
26
  if (!gremlin) return 'pull' // query initialization
27

28
  if (
29
    gremlin.state.aggregate[args[0]] &&
30
    gremlin.state.aggregate[args[0]].includes(gremlin.vertex)
31
  )
32
    return 'pull'
33
  if (
34
    gremlin.state.as &&
35
    gremlin.state.as[args[0]] &&
36
    gremlin.vertex == gremlin.state.as[args[0]]
37
  )
38
    return 'pull' // TODO: check for nulls
39
  return gremlin
40
})

Data collection:

Each node being traversed is added to the specified set (args[0]). The name of this set is provided as an argument when the function is called. For example, if the user provides the argument ‘firstDegreeFriends’, the nodes are aggregated under that name.

Conditional processing and returning results:

Once all data has been collected — that is, once the done flag is set to true — the collected nodes are popped one at a time, a Gremlin is created from each, and it is returned. This process repeats until all nodes have been processed.
If there are no more nodes to return, it returns ‘pull’ so the traversal can continue.

State management

The state.aggregated flag tracks whether the collected nodes have already been processed. If this flag has not been set — that is, the first time the done condition is satisfied — the collected nodes are copied into an array to prepare to process them in order.
After each node is processed, a node is removed one at a time from the state.vertices array, and a new Gremlin is created from that node and returned.

The result is as follows.

1
2nd degree friends of User 1: [ 'User6', 'User5', 'User8', 'User7' ]

Summary

While studying the principles of graph databases, I was able to look a little more deeply into the concept of lazy loading and into the most common logic of an interpreter. I was also fascinated by being able to write relationship-based queries intuitively. If I spent more time on it, I think I could write a variety of pipetypes — in the sense of maintaining or even improving the level of performance — to make queries shorter or more versatile.

The code I wrote while doing this case study is available here.

References

https://aosabook.org/en/500L/dagoba-an-in-memory-graph-database.html

Dagoba: in-memory Graph DB

Build Graph

Query

Eager Loading Problem

Query run: interpreter

Variable descriptions

Execution flow

The role of done

Pipetypes

pipetype: vertex

pipetype: in-out

pipetype: property

pipetype: unique

pipetype: filter

pipetype: take

pipetype: as

pipetype: except

pipetype: back

Examples

Acquaintance relationships

Course enrollment

Second-degree but not first-degree

State management

Summary

References

The role of `done`