18 March 2024

Importing Data Using CSV Files

A comma-separated value (CSV) file is a popular way to represent simple text data and exchange it between systems. Fields in each data record are separated by a comma (or another delimiter). GigaSpaces provides a CsvReader class to import data from CSV files to the data grid. The CsvReader is a relatively simple method for reading moderate amounts of CSV data. This can be very useful when first getting started with GigaSpaces products, or when prototyping scenarios.

Usage

The CsvReader class reads CSV files and returns a stream of Space Where GigaSpaces data is stored. It is the logical cache that holds data objects in memory and might also hold them in layered in tiering. Data is hosted from multiple SoRs, consolidated as a unified data model. entries. The first line of the CSV file must be metadata, listing the fields names (and optionally the field types). The user can opt to read the CSV file to the data grid using either POJOs or documents.

Loading POJOs to the Data Grid

Many developers prefer to work with POJOs because they provide strong typing. To load POJOs from a CSV file, you need to provide the file path and the POJO Plain Old Java Object. A regular Java object with no special restrictions other than those forced by the Java Language Specification and does not require any classpath. type, For example, the following code streams Person entries from persons.csv and writes them to a Space:

GigaSpace gigaSpace = ...;
new CsvReader().read(Paths.get("persons.csv"), Person.class).forEach(gigaSpace::write);

The data in the CSV file must be compatible with the POJO class definition (property names and types). The CSV file may contain a subset of the properties, and the missing properties will maintain their default values (usually null or 0).

Loading Documents to the Data Grid

You need a type descriptor in order to load documents from a CSV file to the data grid.

Registered Data Type

If the type descriptor is already registered in the Space, you can load the data using one of the following methods.

Retrieve the type descriptor and use it to read the documents, as in the following code example :

GigaSpace gigaSpace = ...;
SpaceTypeDescriptor typeDesc = gigaSpace.getTypeManager().getTypeDescriptor("person")
new CsvReader().read(Paths.get("persons.csv"), typeDesc).forEach(gigaSpace::write);

If the CSV file metadata contains the optional field types, you can specify the type name and the CsvReader will automatically generate a compatible type descriptor and read the documents, as in the following code example :
```
GigaSpace gigaSpace = ...;
new CsvReader().read(Paths.get("persons.csv"), "person").forEach(gigaSpace::write);
```

Unregistered Data Type

If the type is not registered in the Space, first register it using one of these methods.

If the CSV file contains metadata with field types, you can use the readSchema method to read it into a type descriptor builder and register it, as in the following code example:

SpaceTypeDescriptorBuilder typeDescriptorBuilder = new CsvReader().readSchema(
       Paths.get("persons.csv"), "person");
// Optional: enhance type descriptor if needed, e.g. space id or index
gigaSpace.getTypeManager().registerTypeDescriptor(typeDescriptorBuilder.create());

If the CSV file metadata doesn't contain metadata with field types, create the type descriptor with the SpaceTypeDescriptorBuilder, configure it with the property names and types, and register it. After doing this, you can retrieve the type descriptor to read the entries from the CSV file, as described above.

Customized Parsing

The CsvReader supports customization for individual scenarios using an inner helper class called Builder. See the following cases with examples.

Customizing the Value Separator

In this example, the CsvReader is configured to separate the values using a space instead of a comma:

CsvReader csvReader = CsvReader.builder().valuesSeparator(" ").build();

Customizing the Metadata Separator

In this example, the metadata field names and types are separated using a hashtag instead of a colon:

CsvReader csvReader = CsvReader.builder().metadataSeparator("#").build();

Customizing the Type Parsing

The CsvReader handles parsing for common types using system standards, but can be customized to handle custom types or formats using the addParser method. For example, if your dates are stored as "year-month-day", customize the CsvReader as follows:

DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy.MM.dd");
CsvReader reader = CsvReader.builder().addParser(
LocalDate.class.getName(), LocalDate.class, s -> LocalDate.parse(s, formatter))
.build();

Performance

If you intend to use the CsvReader to read large amounts of data to the data grid, there are some issues that should be taken into consideration.

Writing to the Space

The examples shown in this topic simply write each entry to the Space. If your CSV files contains a large amount of data and you want to improve performance, you can group the entries and write them in batches using the GigaSpace.writeMultiple method.

For more information, see the Operations topic in the Developer guide.

Parallelism

The CsvReader class uses the Files.lines method to streamline parsing using a Java 8 stream. Instead of reading all the lines into memory and then processing them, lines are read and parsed one at a time, producing a stream of Space entries. While this technique reads large files without consuming too much of the system resources, it is still limited to leveraging a single core at a time. If you need to further improve performance by reading in parallel, use one of the following methods:

Split the work across multiple threads (such as a thread per file if you have multiple files).
(For GigaSpaces users) Leverage Spark to parallelize the work across multiple workers.