Suggestions
Proper HTTP Headers
xml files and dods/opendap responses should have headers so that they
are properly cachable. This includes at a minimum the
Last-Modified header, but could also require Cache-Control:
public lines and Vary: specifications depending on the
behavior of the server. We, for example, include Vary:
Authorization lines on pages that are derived from
password-protected datasets.
Content-Length also helps increase
reliability: caches will not cache responses that do not match their
Content-Length specification.
Most http servers serve normal files with last-modified tags; those
servers require cgi scripts to set those header lines if the pages are
to be cachable.
Ingrid does display the last-modified information, which can
be helpful in checking a given collection or dataset.
example
with last-modified
| - | THREDDS catalog with last-modified tags pointing to a DODS server
with last-modified tags. For pages derived from datasets with
last-modified tags, Ingrid gives that time as the last updated
time at the bottom of the page. In this case all the pages from the
THREDDS catalog on down have last updated times.
|
THREDDS
page
| - | currently the top THREDDS page is served without last-modified
times. Some server that it points to have last-modified times, so
some of the subpages do have Last updated lines at the bottom.
|
DODS
page without last-modified times
| - | Another example of how a DODS dataset without last-modified times
appears in Ingrid.
|
There are, of course, other ways of looking at the HTTP headers to
make sure that ones servers are delivering last-modified tags on a
given WWW response.
DODS Request Size Negotiation
One way to get good transfer rates is to ask for larger pieces: this
is a pure win for servers that stream, and even for servers that
process in a single chunk, the ideal size could very well be larger
than a single lat/lon slice. Short of the client trying a whole bunch
of sizes and keeping track of the results, there is no good way to
figure out the optimal size. And it is easy to end up with a server
that has ill-defined behavior when the request is too large, the
classic response being an error message inserted in the data stream.
The classic c-behavior where the client asks for as much as it wants
and the server returns as much as it can has a certain
grace-and-style: not entirely clear whether we can achieve the same.
Given gigabit ethernet, can we really stick with a 2GB limit on the
size of a single request?
global aliases
THREDDS has an interesting ability to have multiple dods servers for a
particular dataset. This means that a server that is re-serving a
dataset could make that particularly clear by also marking the dataset
with the original dods server. There are a few cases where one might
want to carry this farther:
- If one has picked out one variable from a much larger dataset
(e.g. the best-estimate from a dataset which also includes
number-of-observations, std-dev, smoothed, unsmoothed version), it
would be nice if that relationship could be indicated as well.
- if only some of the metadata has changed, it would be nice if the
client could figure out that the data itsself does not need to be
recopied.
Literature references in XML
Frequently (one hopes) the dataset metadata includes literature
references. There must be one or more XML standards for transmitting
such information: it would be great if we could pick and support one.
Visualization metadata
Some visualization metadata should get transmitted with the dataset,
particularly preferred colorscales. At the moment, we have a list of
named colorscales and carry the name across, but we would prefer to be
able to describe an arbitrary colorscale. My preference would be to
transmit this as a specialized DODS dataset, with the independent
variable corresponding to the data values and the dependent
variable(s) giving the color values. This would be one example of an
attribute being a reference to (another) DODS dataset/variable.
short and long names for datasets
Language-based clients can make good use of short as well as more
complete descriptions of datasets. THREDDS should facilitate that.
For example, the CDC dataset that I used as an earlier example is
represented in Ingrid as
THREDDS
(Public Climate Data from the NOAA-CIRES Climate Diagnostics Center) @@
.CPC_.25x.25_Daily_US_UNIFIED_Precipitation
(Monthly Accumulated Precipitation) @@
and the dataset that I read via THREDDS from the Data Library is
THREDDS
(IRI/LDEO Climate Data Library) @@
.NOAA .NCEP .EMC .CMB .GLOBAL .Reyn_SmithOIv2 .weekly .ssta
While the long names are good for display, it is very useful to have
short unique names that can be used to concisely refer to the datasets on a
server or a server in a collection of servers. We all concoct these
short names for internal use: THREDDS should let us share them with
each other. Something as simple as having both name and
long_name in the standard with only name required would suffice
to allow data providers to share their short names.
arrays of bytes vs. strings
Some servers translate netcdf arrays of byte data into DODS arrays of
one character strings. Easier for client writers if you leave them as
byte arrays or better-yet translate them to multi-character strings.
If you do not translate to multi-character strings, the client has to figure out that it needs to translate to multi-character strings, and the
client does not even know that the data came from netcdf files in the
first place, i.e. it has an even harder problem than the
netcdf-to-dods server did.