Ndarray Data Language
- Scope
- Status of this Document
- Syntax
- File
- Datatype
5.1. String
5.2. Integer
5.3. Floating-point
5.4. Opaque
5.5. Enum
5.6. Object Reference
5.7. Region Reference
5.8. Compound
5.9. Vlen
5.10. Array - Shape
- Attribute
- DimCoord
- Ndarray
- Group
- Storage Directives
Scope
This document specifies the syntax of the Ndarray Data Language. The language provides a text format for describing the content of files with multidimensional array (ndarray, data cube) data in many storage formats. However, the language is not intended to be another storage format and is best suited for creation and exchange of file content information.
Status of this Document
The current content is a work in progress. For those who prefer semantic version identifiers, let’s say it is at version 0.6.1.
Each new version of the document supersedes the previous one.
Syntax
Ndarray Data Language is based on the data serialization language YAML (YAML Ain’t Markup Language). YAML was chosen because it is mature, readable and easy to generate by humans, supported in many programming languages, and with sufficient flexibility to represent ndarray file content information. The Ndarray Data Language’s syntax is not tied to a specific YAML version as it is not anticipated any future YAML version will introduce backward incompatible features.
Ndarray Data Language describes file content using seven entities: File, Group, Ndarray, DimCoord, Attribute, Datatype and Shape. Their explanation and YAML syntax are detailed below.
File
The File entity represents content information of one ndarray file and is encoded as one YAML document. If saving this document to a file, its name should be the same as the ndarray file with a new file extension: yaml
or yml
. For example, if a file’s name is example.fmt
, the YAML document’s file should be example.yaml
or example.yml
.
Datatype
The Datatype entity declares the type of data that other entities can hold and this information is available in the type
key. The currently supported data type categories and their declaration syntax are in the following subsections.
String
The String datatype is indicated with:
type: string
Since YAML is a Unicode-based data language, the string
datatype keyword always represents a sequence of Unicode characters (code points) regardless of the actual storage in an ndarray file.
Integer
The accepted integer datatype keywords and the value sets they represent are listed in the table below:
Keyword | Type | Value Set |
---|---|---|
int8 |
8-bit integer | -128 to 127 |
uint8 |
unsigned 8-bit integer | 0 to 255 |
int16 |
16-bit integer | −32768 to 32767 |
uint16 |
unsigned 16-bit integer | 0 to 65535 |
int32 |
32-bit integer | −2147483648 to 2147483647 |
uint32 |
unsigned 32-bit integer | 0 to 4294967295 |
int64 |
64-bit integer | -9223372036854775808 to 9223372036854775807 |
uint64 |
unsigned 64-bit integer | 0 to 18446744073709551615 |
Floating-point
The supported floating-point datatype keywords are:
Keyword | Description |
---|---|
float32 |
IEEE 754 single-precision (32 bit) floating-point |
float64 |
IEEE 754 double-precision (64 bit) floating-point |
Opaque
The Opaque datatype represents a fixed-length sequence of bytes without specific interpretation. This datatype allows storing what is known as binary large object (BLOB). An optional text-valued tag provides description. Below is an example for storing images in the PNG format. Maximum image size is 64,000 bytes and the MIME type for this image format is in the tag to help other applications interpret the bytes correctly.
type:
opaque:
size: 64000
tag: image/png
Enum
The Enum datatype describes a set of named integer constants. Names of the constants can be any Unicode string. The integer datatype information is optional although it is recommended to include it. If missing, an integer datatype with the smallest value set still capable of representing all the constants will be assumed.
The example below shows an Enum datatype with three named constants, UP
, DOWN
, and CENTER
, and their values. The base
key holds the integer datatype.
type:
enum:
base: int8
members:
UP: 0
DOWN: 25
CENTER: -120
This example illustrates an Enum datatype without the base
key:
type:
enum:
members:
OFF: 0
ON: 1
UNDEFINED: 255
When processing this datatype description, the uint8
will be assumed since its value set is enough to store all three constant values.
Object Reference
This datatype is similar to pointers in some programming languages. Values point to other objects in the same file. Its declaration is:
type: objref
Region Reference
A value of the region reference datatype points to a selection of elements of one ndarray in the same file. There are two ways how ndarray elements can be selected: block
and element
.
block
selections are contiguous subsets of ndarray’s elements of the same rank as the ndarray. Each block is described with two ndarray elements at the diagonally opposite corners of that block. The block corner elements are described with tuples with their dimension indices.
element
selections are collections of individual ndarray elements. Each ndarray element is specified using a tuple with its dimension indices.
The two syntax forms for this datatype are shown below:
type:
regref:
selection: block
type:
regref:
selection: element
Compound
This datatype represents a sequence of named members of other datatypes. For example:
type:
compound:
- x: float32
- y: int32
- z: float64
describes a compound datatype with three members, named: x
, y
, and z
, and their datatypes.
Vlen
The Vlen datatype represents a variable-length sequence of elements of another datatype. Because the number of elements is not fixed, the datatype’s syntax involves only the element datatype in the base
key:
type:
vlen:
base: uint8
The above declares a variable-length sequence of uint8
values.
Array
The Array datatype represents an element value as another array of fixed rank and extent where all elements are of some other datatype. This datatype is described with two keys: base
and shape
:
type:
array:
base: float32
shape: [3, 3]
In this example the Array datatype is a two-dimensional 3-by-3 array of float32
values.
Shape
The Shape entity defines the rank (number of dimensions) and extent (the size of each dimension) of an ndarray. This information is provided in the shape
key:
shape: [10, 20, 30]
or alternatively:
shape:
- 10
- 20
- 30
Both of these examples define a three-dimensional ndarray with the size of its dimensions: 10, 20, and 30, respectively.
Unlimited dimension size is declared as null
:
shape: [null, 20, 30]
The above example indicates that the first dimension is of unlimited size.
A scalar (zero-dimension) ndarray is declared with:
shape: []
Attribute
Some ndarray file formats support assigning properties to other objects in the same file. These properties usually provide contextual information, known as metadata, for the rest of the file’s data. They are represented with the Attribute entity which consists of name (Unicode string), shape, datatype, and value.
The required key attributes
holds a nested map with one or more attributes:
attributes:
a:
shape: []
type: string
value: Ηελλο ωορλδ
b:
shape: []
type: int32
value: 10
same_as_a: Ηελλο ωορλδ
same_as_b: 10
state:
shape: [3]
type: string
value: [power on, power off, error]
The above example includes five attributes named a
, b
, same_as_a
, same_as_b
, and state
. The a
and b
attributes depict the complete syntax. Typically several attributes are assigned to the same file object so there’s also a simpler syntax, as shown by the same_as_a
and same_as_b
attributes. Avoiding the shape
and type
keys keeps attribute description succinct. In such cases attribute’s value will be treated as a scalar of the datatype that is the best match. The state
attribute is an example when the short form cannot apply because it is not scalar-valued.
DimCoord
The DimCoord entity describes dimension coordinates, one-dimensional ndarrays whose elements are mapped to the indices of another ndarray’s dimension. A dimension coordinate is defined by: name (Unicode string), size (a positive integer or null
), datatype, and, optionally, its values.
dimcoords:
x:
size: null
type: float64
attributes:
what: x coordinate
y:
size: 6
type: float32
value: [1.0, 1.1, 1.2, 1.3, 1.4, 1.5]
attributes:
what: y coordinate
The above example illustrates how the DimCoord entity is applied. The required key dimcoords
contains a nested map with the description of one or more dimension coordinates. Here, it contains two dimension coordinates named x
and y
. The x
’s size is null
which indicates an unlimited size, whereas y
is of size 6. The y
dimension coordinate also includes its values in the optional value
key.
Dimension coordinates can have zero or more attributes, declared with the attributes
key.
Ndarray
The Ndarray entity applies to any ndarray in a file that holds data. Its syntax consists of name (Unicode string), shape (rank and extent information), datatype, and, value (optional). Ndarrays can have zero or more attributes. The example below illustrates all this:
ndarrays:
z:
shape: [10, 20]
type: float64
attributes:
description: Values of z
vector:
shape:
- 50
- 60
- 40
type:
compound:
- x: float32
- y: float32
- z: float32
attributes:
description: velocity
The ndarrays
key holds a nested map with ndarray descriptions. In this example there are two ndarrays named z
and vector
. z
is two-dimensional, 10-by-20, ndarray of float64
elements. vector
is three-dimensional, 50-by-60-by-40, ndarray of a compound datatype with three members: x
, y
, and z
.
Dimension coordinates can be used when describing dimension sizes (extent) of ndarrays. In such cases the size of the ndarray’s dimension will be equal to the size of the dimension coordinate and, also, the dimension coordinate’s values are to be interpreted as the coordinates along that ndarray’s dimension. Each index of that ndarray’s dimension is mapped to one dimension coordinate value.
Group
The Group entity allows hierarchical organization of other file content. Its use is optional because not all ndarray file formats have this capability. One Group can contain zero or more Groups and there is no limit on the grouping depth.
Since the Group entity’s functionality is similar to the directory/folder in a file system, the same semantic naming scheme is adopted in the Ndarray Data Language. The example below describes a file with content in three groups:
/:
attributes:
a: This is / group attribute
dimcoords:
d1:
size: 150
type: float32
ndarrays:
n:
shape: [1967, 45]
type: float64
/group1:
attributes:
a: This is /group1 attribute
dimcoords:
d2:
size: null
type: float32
/group2/subgroup1:
attributes:
c: This is /group2/subgroup1 attribute
ndarrays:
nd:
shape:
- /group1/d2
- /d1
Each group is identified with a path name. The group with the path name /
is called root group and is the start of the group hierarchy. Every other group’s path name must begin with the /
character. Order of the groups is not important. For non-hierarchical file formats there will be no groups and all content will be treated as being in the root group without the requirement to explicitly state this.
Each group can have zero or more attributes, dimension coordinates, or ndarrays. Similar to group path names, every dimension coordinate and ndarray in a group has a path name which is constructed using the group’s path name and the dimcoord/ndarray name delimited by the /
character. In the above example, the dimension coordinate path names are: /d1
and /group1/d2
; the ndarray path names are /n
and /group2/subgroup1/nd
.
The /group2/subgroup1/nd
ndarray’s shape is described using dimension coordinates: /group1/d2
and /d1
. This means that the ndarray is two-dimensional with the extent [null
, 150]. Also, this association signals that the indices of the ndarray’s first dimension are mapped to the /group1/d2
’s values, and the indices of the ndarray’s second dimension are mapped to the /d1
values.
Storage Directives
These directives provide specific storage details of dimension coordinate, ndarray, or attribute values. They are optional because support vary among different file formats. There are no default values for any directive when not given. All directives are located in a nested map of the storage
key.
Currently supported directives are explained below.
shape
: Only allowed for ndarrays. A list specifies the actual extent of stored values. Number of its elements must equal ndarray’s rank, with each element’s value lesser or equal to ndarray’s corresponding dimension size.
size
: Only allowed for dimension coordinates. The actual size (number) of stored values.
chunk
: Some file formats support storing ndarray values in a number of chunks (byte streams). They are also called tiles. Chunk size is defined with a list of the same size as ndarray’s rank where each value declares the number of ndarray elements along the corresponding dimension.
filter
: Describes data processing pipeline as a list where each element is one process. The processes apply sequentially to outbound (write) data and in reverse (backward) order to inbound (read) data.
endian
: Byte layout order for numbers. Two valid values: little
, big
.
charset
: String character set.
fillvalue
: The default value of ndarray or dimension coordinate elements in absence of actual data.
- Previous
- Next