thredds icon indicating copy to clipboard operation
thredds copied to clipboard

Sorting for time values before aggregation

Open kthyng opened this issue 8 years ago • 11 comments

Hi all. I just figured out something that had been plaguing me for a week. We have ROMS model output files aggregated by thredds here, for example: http://barataria.tamu.edu:8080/thredds/dodsC/NcML/oof_archive_agg.

The output was coming out all jumbled and weird with some time indices working and others not working. It turns out that the model output files had time stamps that were out of order. "Touch"ing each file in the correct chronological order fixed the problem.

So my question is: would it be possible to have a "sort" step over the time dimension before the aggregation step?

Thanks.

kthyng avatar Aug 22 '17 18:08 kthyng

Dear @kthyng,

How are you aggregating the files?

what is your NcML file?

regards

cofinoa avatar Aug 23 '17 09:08 cofinoa

Roping in @skbaum since I'm a user but he set it up!

kthyng avatar Aug 23 '17 16:08 kthyng

The filenames are of the form:

roms_his_201611.nc roms_his_201612.nc roms_his_201701.nc roms_his_201702.nc roms_his_201703.nc roms_his_201704.nc roms_his_201705.nc roms_his_201706.nc roms_his_201707.nc roms_his_201708.nc

and the NcML is:

<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <aggregation dimName="ocean_time" type="joinExisting" recheckEvery="6 hour">
    <scan location="/atch/raid2/dj/oof_latest/oof/oof/outputs/ncfiles/archives/" regExp="roms_his.*\.nc"/>
  </aggregation>
</netcdf>

Upon reading the aggregation page at:

https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/Aggregation.html

I find the following: "By default, the files are ordered by sorting on the filename." This makes me think that what happened shouldn't have happened, and that the time stamps shouldn't have had to be modified. Perhaps it's a subtle bug.

I also realize that the issue can be forced by specifying each h filename within the NcML, but that would require editing the catalog.xml file every time a file is added.

Steve

skbaum avatar Aug 23 '17 16:08 skbaum

I wonder if this is an issue with the use of the regExp attribute or, perhaps, caching. Would it be possible to do the aggregation without regExp?

lesserwhirls avatar Aug 23 '17 19:08 lesserwhirls

Do you mean listing out the files? If so, that would work, but then it would have to be continually updated since this is an operational system that updates in time. So, that would not be ideal.

kthyng avatar Aug 24 '17 15:08 kthyng

For example, would it be possible to do this:

<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <aggregation dimName="ocean_time" type="joinExisting" recheckEvery="6 hour">
    <scan location="/atch/raid2/dj/oof_latest/oof/oof/outputs/ncfiles/archives/" suffix=".nc" /> 
  </aggregation>
</netcdf>

That is, is the regExp needed because there are files other than roms_his.* in the /atch/raid2/dj/oof_latest/oof/oof/outputs/ncfiles/archives/ directory?

lesserwhirls avatar Aug 24 '17 15:08 lesserwhirls

Oh I see. Yes, there are other *.nc files in the directory.

kthyng avatar Aug 24 '17 16:08 kthyng

Ah, ok. Another question - how many time steps are in each file?

lesserwhirls avatar Aug 24 '17 16:08 lesserwhirls

Every hour for the month, so about 30*24=720 depending on how long the month is.

kthyng avatar Aug 24 '17 16:08 kthyng

So having dug into some of our aggregation code, I can see that touching the files on disk caused a rescan of the collection (the code looks at the last modified time on disk to determine if a file was changed), which is probably why it caused things to work. But, the code is pretty complicated under the hood, unfortunately.

Just so I can understand a bit better here, it looks like you store data in daily netCDF files, and those files are rechecked every 6 hours.

  • Are the netCDF files netcdf-3 or netcdf-4?
  • Are data added to the daily files throughout the day? If so, how is the update to the files on disk done?
  • What OS is controlling the raid, and what filesystem are you using?
  • what OS and version of Java is the TDS running under?

Sorry for all the questions. The code that the standard java runtime library checks last modified time is OS dependent and I've seen reports where certain combinations end up returning the wrong last modified date. Also, depending on how files are being updated (if they are being updated throughout the day), the last modified time may not actually be updated (for example, if the file is held in an open state as data are added).

lesserwhirls avatar Aug 26 '17 13:08 lesserwhirls

Thanks for the detailed response. I'll ping @skbaum again for help on this.

kthyng avatar Sep 10 '17 19:09 kthyng