Query
query feature can perform a query on a dataset stored on disk or in memory.
You can write a Geist template with the query tag. You can also use CLI or Python API step by step as follows:
The create command has two subcommands, both of which create a new dataset on disk. The dataset name :memory:
is a reserved value for datasets that exist only in memory and is not allowed in the CLI.
Usage: geist create [OPTIONS] COMMAND [ARGS]...
Create a new dataset
Options:
--help Show this message and exit.
Commands:
duckdb Create a new SQL dataset using DuckDB
rdflib Create a new RDF dataset using RDFLib
geist create duckdb [OPTIONS]
Usage: geist create duckdb [OPTIONS]
Create a new SQL dataset using DuckDB
Options:
-d, --dataset TEXT Name of SQL dataset to create (default "kb")
-ifile, --inputfile FILENAME Path of the file to be loaded as a Pandas
DataFrame [required]
-iformat, --inputformat [csv|json]
Format of the file to be loaded as a Pandas
DataFrame (default csv)
-t, --table TEXT Name of the table to be created (default
"df")
--help Show this message and exit.
Example 1: create a test
SQL dataset from stdin
geist create duckdb --dataset test --inputformat csv --table df << __END_INPUT__
v1,v2,v3
1,2,3
4,5,6
7,8,9
__END_INPUT__
Example 2: create a test
dataset from a file
Here is the test.csv
file:
v1,v2,v3
1,2,3
4,5,6
7,8,9
Code:
geist create duckdb --dataset test --inputfile test.csv --inputformat csv --table df
geist create rdflib [OPTIONS]
Usage: geist create rdflib [OPTIONS]
Create a new RDF dataset
Options:
-d, --dataset TEXT Name of RDF dataset to create (default "kb")
-ifile, --inputfile FILENAME Path of the file to be loaded as triples
[required]
-iformat, --inputformat [xml|n3|turtle|nt|pretty-xml|trix|trig|nquads|json-ld|hext|csv]
Format of the file to be loaded as triples
(default json-ld)
--colnames TEXT Column names of triples with the format of
[[subject1, predicate1, object1], [subject2,
predicate2, object2], ...] when the input
format is csv
--infer [none|rdfs|owl|rdfs_owl]
Inference to perform on update [none, rdfs,
owl, rdfs_owl] (default "none")
--help Show this message and exit.
Example 1: create a test
RDF dataset from stdin
geist create rdflib --dataset test --inputformat nt --infer none << __END_INPUT__
<http://example.com/drewp> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://example.com/drewp> <http://example.com/says> "Hello World" .
__END_INPUT__
Example 2: create a test
dataset from a file
Here is the test.nt
file:
<http://example.com/drewp> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://example.com/drewp> <http://example.com/says> "Hello World" .
geist create rdflib --dataset test --inputfile test.nt --inputformat nt --infer none
query function can perform a query on a dataset.
Parameters description for query():
Name | Type | Description | Default |
---|---|---|---|
datastore | string | A backend datastore, i.e., 'rdflib' or 'duckdb' |
REQUIRED |
dataset | string OR DuckPyConnection object OR GeistGraph object |
(1) A string indicates the name of the dataset stored on disk OR (2) a DuckPyConnection object OR a GeistGraph object for dataset in memory |
REQUIRED |
inputfile | string | File containing the query | REQUIRED |
isinputpath | bool | True if the inputfile is the file path, otherwise the inputfile is the content | REQUIRED |
hasoutput | bool | True to store the query results as a CSV file or print them out | REQUIRED |
config | dict | A dictionary with configurations when hasoutput=True |
see below |
Description for the config parameter:
Name | Type | Description | Default |
---|---|---|---|
outputroot | string | Path of the directory to store the query results | './' |
outputfile | string | Path of the file to store the query results | None |
Example 1: all rows of the df
table in test
dataset on disk (query from a string)
There exist a file with the path of .geistdata/duckdb/test.duckdb
. The following code returns a Pandas data frame named res
with query results, and a DuckPyConnection
object.
import geist
# Query the df table of the test dataset
(res, conn) = geist.query(datastore='duckdb', dataset='test', inputfile="SELECT * FROM df;", isinputpath=False, hasoutput=False)
Example 2: all rows of the df
table in test
dataset on disk (query from a file)
There exist a file with the path of .geistdata/duckdb/test.duckdb
. The following code returns a Pandas data frame named res
with query results, and a DuckPyConnection
object.
Here is the query.txt
file:
SELECT * FROM df;
Code:
import geist
# Query the df table of the test dataset
(res, conn) = geist.query(datastore='duckdb', dataset='test', inputfile="query.txt", isinputpath=True, hasoutput=False)
Example 3: all rows of the df
table in test
dataset in memory (query from a string)
Suppose conn
is a DuckPyConnection
object points to a DuckDB dataset in memory. The following code returns a Pandas data frame named res
with query results, and the same DuckPyConnection
object.
import geist
# Query the df table of the test dataset
(res, conn) = geist.query(datastore='duckdb', dataset=conn, inputfile="SELECT * FROM df;", isinputpath=False, hasoutput=False)
Example 4: all rows of the df
table in test
dataset in memory (query from a file)
Suppose conn
is a DuckPyConnection
object points to a DuckDB dataset in memory. The following code returns a Pandas data frame named res
with query results, and the same DuckPyConnection
object.
Here is the query.txt
file:
SELECT * FROM df;
Code:
import geist
# Query the df table of the test dataset
(res, conn) = geist.query(datastore='duckdb', dataset=conn, inputfile="query.txt", isinputpath=True, hasoutput=False)
query method of the Connection class can query a dataset stored on disk or in memory. It is very similar to the query()
function. The only difference is that the datastore
and the dataset
parameters do not need to be passed as they have already been specified while initialze the Connection class.
Parameters description for query method of the Connection class:
Name | Type | Description | Default |
---|---|---|---|
inputfile | string | File containing the query | REQUIRED |
isinputpath | bool | True if the inputfile is the file path, otherwise the inputfile is the content |
REQUIRED |
hasoutput | bool | True to store the query results as a CSV file or print them out |
REQUIRED |
config | dict | A dictionary with configurations when hasoutput=True |
see below |
Description for the config parameter:
Name | Type | Description | Default |
---|---|---|---|
outputroot | string | Path of the directory to store the query results | './' |
outputfile | string | Path of the file to store the query results | None |
Example 1: all rows of the df
table in test
dataset on disk (query from a string)
There exist a file with the path of .geistdata/duckdb/test.duckdb
. The following code returns a Pandas data frame named res
with query results.
import geist
# Create a Connection instance
connection = geist.Connection.connect(datastore='duckdb', dataset='test')
# Query the df table of the test dataset
res = connection.query(inputfile="SELECT * FROM df;", isinputpath=False, hasoutput=False)
Example 2: all rows of the df
table in test
dataset on disk (query from a file)
There exist a file with the path of .geistdata/duckdb/test.duckdb
. The following code returns a Pandas data frame named res
with query results.
Here is the query.txt
file:
SELECT * FROM df;
Code:
import geist
# Create a Connection instance
connection = geist.Connection.connect(datastore='duckdb', dataset='test')
# Query the df table of the test dataset
res = connection.query(inputfile="query.txt", isinputpath=True, hasoutput=False)
Example 3: all rows of the df
table in test
dataset in memory (query from a string)
Suppose conn
is a DuckPyConnection
object points to a DuckDB dataset in memory. The following code returns a Pandas data frame named res
with query results.
import geist
# Create a Connection instance
connection = geist.Connection(datastore='duckdb', dataset=':memory:', conn=conn)
# Query the df table of the test dataset
res = connection.query(inputfile="SELECT * FROM df;", isinputpath=False, hasoutput=False)
Example 4: all rows of the df
table in test
dataset in memory (query from a file)
Suppose conn
is a DuckPyConnection
object points to a DuckDB dataset in memory. The following code returns a Pandas data frame named res
with query results.
Here is the query.txt
file:
SELECT * FROM df;
Code:
import geist
# Create a Connection instance
connection = geist.Connection(datastore='duckdb', dataset=':memory:', conn=conn)
# Query the df table of the test dataset
res = connection.query(inputfile="query.txt", isinputpath=True, hasoutput=False)