Metadata of datasets
A central aspect in the model data explorer is the metadata of the datasets. We do not want to mimic a metadata portal such as geonetwork, but nevertheless, we need information to
let the user know what he or she is looking at
link datasets to groups
link datasets to people
Each author and group has a dedicated site where all the related datasets are listed. So we need to find a way to uniquely identify authors and associate the datasets with them.
We will implement a manual way where you can select the authors and edit the metadata through a web interface, but it should be possible to automatically interprete metadata standards.
In the following sections, we will describe what metadata we implement in the, and how.
This document only covers global metadata of a dataset. Variable related metadata (units, standard name, etc.) shall be handled in a different document.
Document variable related metadata
Datasets in the model data explorer must define the following metadata attributes:
A one-line description of the dataset
Optional but recommended metadata attributes are:
A list of authors that have some role related to the dataset. They participated in the generation, are responsible for providing the data, etc.
The institutions that are responsible for the dataset
related projects that provided funding for the generation of the dataset
The bounding box of the geographic region of the data
A short description of the dataset
datacite relation types (see Relations between Users, Groups and datasets and https://support.datacite.org/docs/relationtype_for_citation)
The temporal window that is covered by the dataset
document geographic and temporal resolution as well?
Add descriptive spatial extent (such as global, continental, etc.)
add creation, publication and revision date
Interpretation of standards
The items mentioned in the previous sections are encoded in the metadata standards that we support, namely the netCDF header and the INSPIRE ISO- Standard. Our aim is to develop readers for each standard that transform the corresponding conventions into the metadata scheme of the model data explorer (see next section, Implementation details).
The exact database structure that allows this interpretation is however part of a different user story, namely #20.
For netCDF Headers (and NcML, a special markup language used by THREDDS) we want to develop guidelines based on the Binding Regulations for Storing Data as netCDF Files. For this purpose, we will transform the guidelines into a web-based format and enhance it with templates to make them easier to apply.
The guidelines are based on the CF-Conventions and extend by further attributes that are mainly motivated by the Conversion methodology to INSPIRE developed at the Geomar.
UnidataDD2MI.xsl methodology needs to be elaborated further.
ISO-conform XML files will be read using the owslib python library. We will orient the format on the UnidataDD2MI.xsl file that has been developed by Franziska Weng (Geomar) and Andrea Pörsch (GFZ) (currently still work in progress).
class Author(models.Model): name = models.CharField(max_length=100) email = models.EmailField(max_length=100) class Dataset(models.Model): title = models.CharField(max_length=50) abstract = models.CharField(max_length=50, null=True, blank=True) contacts = models.ManyToManyField(Author)
Metadata items described above are represented in the model data explorer as properties of Django objects that in turn translates into connections and attributes in a relational database. But for this document we will keep it simple and distinguish two metadata types: attributes and relations.
Attributes are simple string properties of a dataset. A title for instance. Relations describe how the dataset is connected to other items in the database. A dataset won’t have an authors string property, for instance, but it will define a connection to author objects, where one author holds a first name and last name attribute (for instance).
An example is shown in the graph on the right, Object attribute vs. object relations.
A Dataset defines three simple attributes, title, abstract and bounding box (bbox), see the graph about Attributes of a dataset.
class Dataset(models.Model): title = models.CharField(max_length=50) abstract = models.TextField(max_length=10000, null=True, blank=True) bbox = models.JSONField(null=True, blank=True) start = models.DateTimeField(null=True, blank=True) end = models.DateTimeField(null=True, blank=True) start_s = models.CharField(max_length=50, null=True, blank=True) end_s = models.CharField(max_length=50, null=True, blank=True)
The title is a short human-readable description as string of the dataset and should describe the purpose of the data in one sentence.
The CF-Conventions define a
title netCDF attribute that will be
We are using the
<gmd:title> tag of the
The abstract is a longer human-readable description of the dataset that describes the content, purpose and methodology in a bit more details.
We wiill look for global summary or abstract attribute.
We are using the
The bbox is a JSONField (or optionally we can also make it a georeferenced polygon) that defines the region where this dataset can be applied.
We wiill look for the global geospatial_lon_min, geospatial_lat_min, geospatial_lon_max and geospatial_lat_max attributes, as well as a Bbox attribute.
We are using the
EX_GeographicBoundingBox element in the
<gmd:geographicElement> tag, namely
The temporal extent is a
DatetimeField that defines the start and end of a
time window. We will expect two
ISO-formatted timestamps here, one for the
start and one for the end of the coverage.
This might not always be possible, as python does not support paleo dates. So
we will also add a attributes
end_s that accept plain text
We wiill look for the global StartTime, StopTime, time_coverage_start and time_coverage_end attributes.
We are using the
EX_TemporalExtent element in the
<gmd:temporalElement> tag, namely
This relation can also be equipped with permissions, namely can_edit, can_view and can_list (see Datasets and data groups). These permissions need to be approved by both, the data group (project) owner and the dataset.
A relation can also be made visible or invisible, which will determine whether the group is listed explicitly on the detail page of the dataset or not.
class DataGroup(models.Model): name = models.CharField(max_length=100) class RelationPermission(models.Model): name = models.CharField(max_length=20) left_approved = models.BooleanField(default=False) right_approved = models.BooleanField(default=False) class DatasetDataGroupRelation(models.Model): data_group = models.ForeignKey(DataGroup, on_delete=models.CASCADE) dataset = models.ForeignKey("Dataset", on_delete=models.CASCADE) permissions = models.ManyToManyField(RelationPermission) visible = models.BooleanField(default=True) class Dataset(models.Model): title = models.CharField(max_length=50) data_groups = models.ManyToManyField( DataGroup, through=DatasetDataGroupRelation )
netCDF files can define a project, program, projects or
project_name attribute. We will then search for matching names in
the data groups that define a
DS_InitiativeTypeCode kind of
project and suggest them to the data submitter. This will also be
documented in the netCDF guidelines (see CF-Conventions).
We will look for
MD_AggregateInformation entries that define
largerWorkCitation and match
MD_Identifier against the available data groups.
Add ROR ID
netCDF files can define an institution or creator_institution attribute, together with a corresponding institution_references attribute. They will then be matched against available names of institutions in the database to make suggestions to the data submitter.
Institutions will be identified from the
organisationName in a
CI_ResponsibleParty (see Authors and Contact Persons above).
Other relations are references to internal or external resources, such as related studies or datasets. They are commonly described by datacite related identifiers, see https://support.datacite.org/docs/relationtype_for_citation.
However, neither the CF-Conventions nor INSPIRE define such a relation type. But both give the possibilities to add supplementary studies, (see below) and we’ll just add these informations as a DatasetReference object (see the Graph tab).
If the URI however corresponds to a handle in the model data explorer, we can also directly transfer this into a relation between datasets (see Graph tab) and suggest that Dataset A is supplement to Dataset B (see Datasets).
class Dataset(models.Model): title = models.CharField(max_length=50) class DatasetReference(models.Model): dataset = models.ForeignKey(Dataset, on_delete=models.CASCADE) description = models.CharField(max_length=400) uri = models.URLField(max_length=300) class DatasetRelation(models.Model): left = models.ForeignKey(Dataset, on_delete=models.CASCADE, related_name="left_relation") right = models.ForeignKey(Dataset, on_delete=models.CASCADE, related_name="right_relation") relation_type = models.CharField(max_length=30)
netCDF files can define global references and doi attributes.
We will check here for common DOI patterns and use this to extract
the uri for the
DatasetReference (see the
Graph tab above).
INSPIRE encodes references as
MD_AggregateInformation with a
we will just use the
MD_Identifier of these tags. If the
MD_Identifier is listed as
gmd:code, we will assume it’s a
DOI and transform it to the corresponding URL, otherwise we take
it as the description of the
DatasetReference and try if we
find a URL in it.