faunadb-importer
faunadb-importer copied to clipboard
Importer tool to load data into FaunaDB
FaunaDB Importer
FaunaDB Importer is a command line utility to help you import static data into FaunaDB. It can import data into FaunaDB Cloud or an on-premises FaunaDB Enterprise cluster.
Supported input file formats:
- JSON
- CSV
- TSV
Requirements:
- Java 8
Usage
Download the latest version and extract the zip file. Inside the extracted folder, run:
./bin/faunadb-importer \
import-file \
--secret <keys-secret> \
--class <class-name> \
<file-to-import>
NOTE: The command line arguments are the same on Windows, but you must use a different startup script. For example:
.\bin\faunadb-importer.bat import-file --secret <keys-secret> --class <class-name> <file-to-import>
For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
data/users.json
The importer will load all data into the specified class, preserving the field names and types as described in the import file.
You can also type ./bin/faunadb-importer --help for more detailed information.
How it works
The importer is a stateful process separated into two phases: ID generation and data import.
First, the importer will parse all records and generate unique IDs by calling
the next_id
function for each record. Pre-generating IDs beforehand allows us to import
schemas containing relational data while keeping foreign keys consistent. It
also ensures that we can safely re-run the process without the risk of
duplicating information.
In order to map legacy IDs to newly generated Fauna IDs, the importer will:
- Check if there is a field configured with the
reftype. The field's value will be used as the lookup term for the new Fauna ID. - If no field is configured with the
reftype, the importer will assign a sequential number for each record as the lookup term for the new Fauna ID.
Once this phase completes, the pre-generated IDs will be stored at the cache
directory. In case of a re-run, the importer will load the IDs from disk and
skip this phase.
Second, the importer will insert all records into FaunaDB, using the pre-generated IDs from the first step as their ref field.
At this phase, if the import fails to run due to data inconsistency, it is:
- SAFE to fix data inconsistencies in any field except fields configured
with the
reftype. - NOT SAFE to change fields configured with the
reftype as they will be used as the lookup term for the pre-generated ID from the first phase. - NOT SAFE to remove entries from the import file if you don't have a
field configured as a
reffield; this will alter the sequential number assigned to the record.
As long as you keep the cache directory intact, it is safe to re-run the
process until the import completes. If you want to use the importer again with a
different input file, you must empty the cache directory first.
File structure
.
├── README.md # This file
├── bin #
│ ├── faunadb-importer # Unix startup script
│ └── faunadb-importer.bat # Windows startup script
├── cache # Where the importer saves its cache
├── data # Where you should copy the files you wish to import
├── lib #
│ └── faunadb-importer-1.0.jar # The importer library
└── logs # Logs for each execution
Advanced usage
Configuring fields
When importing JSON files, field names and types are optional; when
importing text files, you must specify each field's name and type in order using
the --format option:
./bin/faunadb-importer \
import-file \
--secret "<your-keys-secret-here>" \
--class <your-class-name> \
--format "<field-name>:<field-type>,..." \
<file-to-import>
For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--format "id:ref, username:string, vip:bool" \
data/users.csv
Supported types:
| Name | Description |
|---|---|
string |
A string value |
long |
A numeric value |
double |
A double precision numeric value |
bool |
A boolean value |
ref |
A ref value. It can be used to mark the field as a primary key or to reference another class when importing multiple files. For example city:ref(cities) |
ts |
A numeric value representing the number of milliseconds passed since 1970-01-01 00:00:00. You can also specify your own format as a parameter. For example: ts("dd/MM/yyyyTHH:mm:ss.000Z") |
date |
A date value formatted as yyyy-MM-dd. You can also specify your own format as a parameter. For example: date("dd/MM/yyyy") |
Renaming fields
You can rename fields from the input file as they are inserted into FaunaDB with the following syntax:
<field-name>-><new-field-name>:<field-type>
For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--format "id:ref, username->userName:string, vip->VIP:bool" \
data/users.csv
Ignoring root element
When importing a JSON file where the root element of the file is an array, or
when importing a text file where the first line is the file header, you can skip
the root element with the --skip-root option. For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--skip-root true \
data/users.csv
Ignoring fields
You can ignore fields with the --ignore-fields option. For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--format "id:ref, username->userName:string, vip->VIP:bool" \
--ignore-fields "id" \
data/users.csv
NOTE: In the above example, we omit the id field when importing the data into FaunaDB, but we still use the id field as the ref type so that the importer tool will properly map the newly-generated Fauna ID for that specific user.
How to maintain data in chronological order
You can maintain chronological order when importing data by using the
--ts-field option. For example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--ts-field "created_at" \
data/users.csv
The value configured in the --ts-field option will be used as the ts
field for the imported instance.
Importing to your own cluster
By default, the importer will load your data into FaunaDB Cloud. If you wish to
import the data to your own cluster, you can use the --endpoints option. For
example:
./bin/faunadb-importer \
import-file \
--secret "abc" \
--class users \
--endpoints "http://10.0.0.120:8443, http://10.0.0.121:8443" \
data/users.csv
NOTE: The importer will load balance requests across all configured endpoints.
Importing multiple files
In order to import multiple files, you must run the importer with a schema definition file. For example:
./bin/faunadb-importer \
import-schema \
--secret "abc" \
data/my-schema.yaml
Schema definition syntax
<file-address>:
class: <class-name>
skipRoot: <boolean>
tsField: <field-name>
fields:
- name: <field-name>
type: <field-type>
rename: <new-field-name>
ignoredFields:
- <field-name>
For example:
data/users.json:
class: users
fields:
- name: id
type: ref
- name: name
type: string
ignoredFields:
- id
data/tweets.csv:
class: tweets
tsField: created_at
fields:
- name: id
type: ref
- name: user_id
type: ref(users)
rename: user_ref
- name: text
type: string
rename: tweet
ignoredFields:
- id
- created_at
Performance considerations
The importer's default settings should be enough to provide good performance for most cases. Still, there are a few things that are worth mentioning:
Memory
You can set the maximum amount of memory available to the import tool with
-J-Xmx. For example:
./bin/faunadb-importer \
-J-Xmx10G \
import-schema \
--secret "abc" \
data/my-schema.yaml
NOTE: Parameters prefixed with -J must be placed as the first parameters for
the import tool.
Batch sizes
The size of each individual batch is controlled by --batch-size parameter.
In general, individual requests will have a higher latency with a larger batch size. However, the overall throughput of the import process may increase by inserting more records in a single request.
Large batches can exceed the maximum size of a HTTP request, forcing the import tool to split the batch into smaller requests, therefore degrading the overall performance.
Default: 50 records per batch.
Managing concurrency
Concurrency is configured using the --concurrent-streams parameter.
A large number of concurrent streams can cause timeouts. When timeouts happen, the import tool will retry failing requests applying exponential backoff to each request.
Default: the number of available processors * 2
Backoff configuration
Exponential backoff is a combination of the follow parameters:
network-errors-backoff-time: The number of seconds to delay new requests when the network is unstable. Default: 1 second.network-errors-backoff-factor: The number to multiplynetwork-errors-backoff-timeby per network issue detected; not to exceedmax-network-errors-backoff-time. Default: 2.max-network-errors-backoff-time: The maximum number of seconds to delay new requests when applying exponential backoff. Default: 60 seconds.max-network-errors: The maximum number of network errors tolerated within the configured timeframe. Default: 50 errors.reset-network-errors-period: The number of seconds the import tool will wait for a new network error before resetting the error count. Default: 120 seconds.
License
All projects in this repository are licensed under the Mozilla Public License