Management and Analysis Phase
The core components of the data management workflow centered around the CMIP data pool are indicated in the picture below with the numbers 1 to 4:
1: Data quality assurance
Based on the quality assurance tool developed at DKRZ all data from national modeling groups is checked before ingestion into the CMIP data pool starts. To support modeling groups in early spot-checking of individual data files a web based service has been established.
.. details like references and documentation to be added ..
2: ESGF data publication
Based on ESGF data node installations at DKRZ the data is published to the DKRZ ESGF portal. This portal provides functionality to register, search for data and access data and thus makes the data visible and accessible for the national as well as worldwide research community.
The ESGF data publication also includes steps to make the data referencable and citable: All ESGF published data sets are assigned persistent identifiers (PIDs) and also linked to citation information. In this context DKRZ also acts as a central provider of PID and data citation services in the worldwide ESGF data federation.
All ESGF published data is ingested and managed as part of the overall DKRZ CMIP data pool (see 4.)
3. ESGF data replication
Climate data analysis activities often involve very large data collections including data from modeling centers around the world. Yet downloading and maintaining these high volume datasets at the home institutes of climate researchers is time consuming and very inefficient. Thus climate researchers are supported by central data replication activities and services at DKRZ. Important and oftenly needed data collections are replicated automatically from remote sites and are made locally accessible as part of the ESGF data pool. The replication support team at DKRZ can be reached via the email data_replication /at/ dkrz.de.
4. CMIP data pool: access and management
The CMIP data pool provides the storage resources to make high volume data collections efficiently accessible for the national climate community. Overall 5 Petabyte of data are reserved and made available as part of the overall DKRZ HPC lustre storage resources. These 5 Petabytes need to be consistently managed based on the different user needs as well as based on international agreements with respect to provisioning of replicas.
Priorities with respect to data storage are decided within a review board.