Thursday, May 2, 2013

Labels and Schema Indexes in Neo4j

Neo4j recently introduced the concept of labels and their sidekick, schema indexes. Labels are a way of attaching one or more simple types to nodes (and relationships), while schema indexes allow to automatically index labelled nodes by one or more of their properties. Those indexes are then implicitly used by Cypher as secondary indexes and to infer the starting point(s) of a query.

I would like to shed some light in this blog post on how these new constructs work together. Some details will be inevitably specific to the current version of Neo4j and might change in the future but I still think it’s an interesting exercise.

Before we start though I need to populate the graph with some data. I’m more into cartoon for toddlers than second-rate sci-fi and therefore Peppa Pig shall be my universe.
So let's create some labeled graph resources.


The previous Cypher statements insert 4 nodes that represents the "Pig" family, which are labeled as "Person", and a "Location" node. If you are paying attention, you might have noticed that I also added a dubious “Bob the bat” node, also labeled as "Person". This node bears different properties from the other "Person" nodes and is intended to help me illustrate a point below.

At this point Cypher knows how to find nodes by their labels but nothing has been indexed yet. Let's try a very simple query.


This will cause Cypher to scan through all the nodes labelled as "Person" and will fail in this case with a message similar to "The property 'first_name' does not exist on Node[6]". The reason is that when "Bob the bat" is encountered, Cypher can't find a "first_name" property on it.

We can fix the query by adding "!" on the "first_name" property in the "where" clause to instruct Cypher to disregard any node that doesn't have that property


This query will give the expected result but we can do better. Let's create a schema index on "first_name".


Now we can rerun the first query (the one without the "!"), which should this time return the expected result. The "Bob the bat" node will no more cause us any trouble because the "George" node was returned following an index lookup on "first_name". "Bob the bat" isn't even in the index! Obviously, for a bigger graph, performance will also be significantly better.

It would be nice if we could verify that the index was really used for the last query. Cypher gives us the possibility to profile queries to do that. For the previous query, we can invoke the Cypher endpoint using the profile=true flag to achieve that.


This query will return a rather complex JSON object detailing how the query was executed. Look for "name" : "SchemaIndex" near the bottom, which is an indication that the index was effectively used as we expected.

But what happens if we rerun the second version of the query - the one that contained the "!" operator - now that we have created the schema index? Will the index be used? It turns out that the answer is no. Even with the index in place, Cypher will still scan through all the labeled nodes because of the use of the "!" operator. You can easily see that if you inspect the profiling data, which this time will contain "name" : "NodeByLabel" and no evidence of an index lookup.

This subtle difference might appear confusing at first but the impact on performance can be real (can we hope that Cypher at some point will address this?).
Imagine that you are intialising the graph with the nodes and relationships of your domain. You might have queries of this type.


The performance of this query will gradually deteriorate as the number of "Person" nodes increases. It is better in this case to start by creating an index on "first_name" and to discard the "!" operator.

On a different note, we can create additional schema indexes to target different properties.


What happens if we combine two conditions on the same node like in the following query?


Cypher will use the first property to hit the index, which is "last_name". You can profile this query and observe that that index lookup is returning 4 hits with “Pig” as last name as expected. We can improve things by giving Cypher a hint that the "first_name" index should be used instead, which will result in a single index hit with the desired result.


Last but not least, thanks for @mesirii for the tips on profiling Cypher queries!