Benchmarks icon indicating copy to clipboard operation
Benchmarks copied to clipboard

Download-only option

Open j-woz opened this issue 6 years ago • 7 comments

Allow user to invoke Benchmark in download-only mode, which will simply download the input data if it does not exist. This is necessary on supercomputers. This mode should not import keras or any other modules not required for data download.

j-woz avatar Oct 31 '18 18:10 j-woz

See, for example, p3b1. The line fpath = fetch_data(gParameters) is basically what you want to run separately from the 'run' command. We can modify fetch_file to allow a different base Data directory location to address your other issue?

def fetch_data(gParameters):
    """ Downloads and decompresses the data if not locally available.
        Since the training data depends on the model definition it is not loaded,
        instead the local path where the raw data resides is returned
    """
    
    path = gParameters['data_url']
    fpath = candle.fetch_file(path + gParameters['train_data'], 'Pilot3', untar=True)
    
    return fpath

jmohdyusof avatar Oct 31 '18 20:10 jmohdyusof

That sounds good.

j-woz avatar Oct 31 '18 20:10 j-woz

So probably a command like this should work for both tickets:

python benchmark --dl_only --basedir='/scratch/candle/'

jmohdyusof avatar Oct 31 '18 20:10 jmohdyusof

They will read that as Deep Learn only :) . How about --data-dir ? Will that be a standard flag for all Benchmark invocation? The default will be the current behavior (data directory == Benchmarks/Data).

j-woz avatar Oct 31 '18 21:10 j-woz

Whatever we choose for keywords we can make part of the standard parser, so just decide on ones that don't conflict with other standard (keras/neon/etc) keywords.

--data_dir is fine (we currently use underscore, not dash, to separate words)

is --get_data_only clear enough without being too long?

jmohdyusof avatar Oct 31 '18 22:10 jmohdyusof

Yes, those are fine.

j-woz avatar Nov 01 '18 16:11 j-woz

How strict is the 'don't import Keras' restriction? We need to be able to read the default_model file to get data locations, as well as import the command line parser, so this implies some sort of split between the initialize_parameters stage, the data load and the actual run. I think it makes sense to make the initialize_parameters a standalone function also.

jmohdyusof avatar Nov 01 '18 17:11 jmohdyusof