2.20. OpenCypher Fragments¶
OpenCypher fragments provide a mechanism to evaluate OpenCypher expressions as part of another operation.
load()
and insert()
use OpenCypher fragments to transform raw input data and map input columns to frame columns using the row_filter
parameter.
save()
and get_data()
use OpenCypher fragments to transform the frame’s data by mapping to output columns using the row_filter
parameter.
The fragment must be passed as a Python string:
frame.load(file, row_filter = '<fragment code>')
frame.insert([ [ 0, 1 ], [ 1, 2 ]], row_filter = '<fragment code>')
frame.save(file, row_filter = '<fragment code>')
frame.get_data(row_filter = '<fragment code>')
An OpenCypher fragment supports a subset of OpenCypher’s clauses, expressions and functions that can be evaluated in a stateless manner.
An OpenCypher fragment can only contain the clauses WHERE
and RETURN
to support filtering and output mapping respectively.
Other OpenCypher clauses are not allowed in a fragment.
For input operations, an OpenCypher fragment is applied independently to each row of the input data to produce the final output data stored in the frame.
For output operations, an OpenCypher fragment is applied to each row of the frame (subject to offset and length selections) to produce a row of the output data.
2.20.1. Input OpenCypher Fragments¶
The example below shows a simple input OpenCypher fragment that filters raw input data based on the value of one of the columns and computes the output columns of the target frame. The first column has ten added to it, while the second column is passed through unmodified.
WHERE a.value > 0
RETURN a.value + 10, a.text
The input to an OpenCypher fragment is the raw data coming from the input operation (file or Python client data) organized in a single row. All columns present in the input are available to the fragment for evaluation. A schema is derived for the raw input data using type inference to map the input data types to xGT’s types. Column names are automatically computed from the schema.
2.20.2. Output OpenCypher Fragments¶
The example below shows a simple output OpenCypher fragment that filters frame data based on the value of one of the columns and computes the output of the target. The first output column has a string concatenated to it, while the second column is passed through unmodified.
WHERE a.value > 0
RETURN a.text + "_from_fragment" AS mytext, a.value AS myvalue
The input to an OpenCypher fragment, in this case, are the frame’s rows that are within the range of rows selected by the offset
and length
parameters to the operation.
All columns present in the frame are available to the fragment for evaluation.
Output column names are automatically computed if not provided by an AS
clause.
2.20.3. Bound Variables¶
OpenCypher requires a bound variable to refer to columns that are part of the row.
OpenCypher fragments can use any name for this implied bound variable, but only a single name is allowed.
In the example above, the bound variable a
is the only name allowed to refer to the input row.
An expression b.property
in the same OpenCypher fragment will produce an error as it refers to a different bound variable.
2.20.4. Interaction with Different Data Sources¶
OpenCypher fragments are supported for all of xGT’s input data sources:
Parquet files.
CSV files with and without headers.
Client data as Python list of lists, pandas DataFrames, or pyarrow Tables.
While the behavior of OpenCypher fragments is uniform across all sources, there are slight differences between the input sources with respect to schema data types and column names.
2.20.4.1. Parquet Files¶
Parquet files are the simplest case, as they contain a schema as part of their metadata. The column names contained in the file’s schema are used to refer to the input data’s columns. All xGT data types are inferred automatically from the native Parquet data types except for the WGSPOINT, CARTESIANPOINT, and IPADDRESS types. Columns that potentially contain IP address data are presented to the fragment as strings (or lists of strings for list types) and the OpenCypher fragment must treat them as such. xGT supports passing those strings directly to the output frame’s IP address column if no processing is required.
frame = conn.create_table_frame(
name = 'graph__Table',
schema = [['col0', xgt.INT],
['col1', xgt.IPADDRESS]])
WHERE input.value > 0
RETURN input.value + 10, input.ipaddress_as_string
2.20.4.2. CSV Files¶
To process CSV files using an OpenCypher fragment, xGT automatically infers a schema from the raw data. All xGT types (including IP addresses and lists of IP addresses) are inferred from the CSV input, with the exception of point types which will be inferred as lists.
CSV files that include headers will use the column names declared in the header as the column names available in the OpenCypher fragment.
CSV files without headers will use automatically generated column names starting with the prefix f
.
This code shows the frame’s schema as declared on creation.
frame = conn.create_table_frame(
name = 'graph__Table',
schema = [['column_a', xgt.INT],
['column_b', xgt.IPADDRESS]])
The raw CSV input data is:
0, 128.0.0.1
1, 128.0.0.2
2, 128.0.0.3
Because there are no headers in the CSV file, the inferred schema is:
schema = [['f0', xgt.INT],
['f1', xgt.IPADDRESS]]
The fragment must be written using those column names:
WHERE b.f0 > 0
RETURN b.f0 + 10, b.f1
2.20.4.3. Python Data¶
Python data is treated in a similar manner to Parquet data but without column names (as there is no schema metadata). Columns that contain IP address data will be inferred as columns of strings. Columns that contain point data will be inferred as lists of floats or integers. Column names are automatically generated in the same manner as for CSV files without headers.
The code below shows the frame’s schema as declared on creation.
frame = conn.create_table_frame(
name = 'graph__Table',
schema = [['column_a', xgt.INT],
['column_b', xgt.IPADDRESS],
['column_c', xgt.LIST, xgt.TEXT]])
The raw Python input data is:
[[ 1, 128.0.0.1, [ "one" ]],
[ 2, 128.0.0.2, [ "one", "two" ]],
[ 3, 128.0.0.3, [ "one", "two", "three" ]]]
The inferred schema is:
schema = [['f0', xgt.INT],
['f1', xgt.TEXT],
['f2', xgt.LIST, xgt.TEXT]]
The fragment must be written using those column names:
WHERE python_data.f0 > 0
RETURN python_data.f0 + 10, python_data.f1, tail(python_data.f2)
The resulting data inserted into the frame is:
11, 128.0.0.1, [ ]
12, 128.0.0.2, [ "two" ]
13, 128.0.0.3, [ "two", "three" ]
Note that this example also illustrates how to process columns that contain lists. In this case, lists of strings.
2.20.5. Requirements for Output Columns¶
The RETURN clause in an OpenCypher fragment specifies how the target frame’s columns will be populated for each row of raw input. The RETURN clause must have one entry per target frame column. Each entry must be of a type that is compatible with the exact type of the column via text encoding (see Encoding Text to Types).
If an output column does not matter or the fragment cannot compute it, then use the value NULL
for that column.
The exception is that key columns in vertex and edge frames cannot be NULL and must have a value.
Consider the following frame schema:
frame = conn.create_vertex_frame(
name = 'graph__Table',
schema = [['col0', xgt.INT],
['col1', xgt.IPADDRESS],
['col2', xgt.LIST, xgt.TEXT]],
key = 'col0')
Some examples of acceptable fragment output for this frame are:
RETURN a.f0, a.f1, a.f2
RETURN a.f0 + 2, "128.0.0.1", NULL
RETURN a.f0 / 4, a.f2, [ "one", "two" ]
2.20.6. Putting It All Together¶
Here is a full example of using OpenCypher fragments to process input data as it is being inserted into a target frame.
Start with creating an empty vertex frame:
frame = conn.create_vertex_frame(
name = 'Systems',
schema = [['id', xgt.INT],
['ipaddr', xgt.IPADDRESS],
['users', xgt.LIST, xgt.TEXT]],
key = 'id')
We want to use IPv6 addresses for the entries in the Systems
vertex frame.
The source data has two 64-bit integer columns that will be used to construct the IPv6 address:
data = [[ 0, 1, 1, [ "alice" ] ],
[ 1, 2, 2, [ "alice", "bob" ] ],
[ 2, 3, 3, [ "bob", "charlie", "diana" ] ]]
This code inserts the data into the frame and process it into its target form using an OpenCypher fragment:
frame.insert(data, row_filter = """
WHERE size(input.f3) > 1
RETURN input.f0, ipaddress(input.f1, input.f2), tail(input.f3)
""")
The resulting entries in the Systems
frame are:
1, ::2:0:0:0:2, [ "bob" ]
2, ::3:0:0:0:3, [ "charlie", "diana" ]
This example created rows with IPv6 addresses from the numeric entries as well as the tail of the input lists.
Consider another example using the same target frame and a CSV file with headers:
machine_id, ipaddr, users
10, '255.255.255.0', [ "u" ]
11, '255.255.255.1', [ "u", "v", "w" ]
12, '255.255.255.2', [ "x", "y", "z" ]
13, '255.255.255.3', [ "u", "a" ]
This code loads the data into the frame and filters it using an OpenCypher fragment:
frame.load('xgtd://file.csv', row_filter = """
WHERE "u" IN input.users
RETURN input.machine_id, input.ipaddr, input.users
""")
The resulting entries in the Systems
frame are:
10, 255.255.255.0, [ "u" ]
11, 255.255.255.1, [ "u", "v", "w" ]
13, 255.255.255.3, [ "u", "a" ]
2.20.7. Interaction with Different Data Targets¶
All data targets supported by the save()
and get_data()
operations are supported directly when using OpenCypher fragments.
Client CSV and Parquet files
Server CSV and Parquet files
Client data as Python list of lists, pandas DataFrames, or pyarrow Tables.
The schema for the frame automatically drives the data types of the computed output columns in the RETURN
clause.
The OpenCypher AS
clause can be used to rename output columns for targets like Parquet files, pandas DataFrames, pyarrow Tables or CSV files with headers.