autokeras icon indicating copy to clipboard operation
autokeras copied to clipboard

Autokeras slow to start for large dataset?

Open gautambak opened this issue 4 years ago • 11 comments

Hi there,

I'm playing with autokeras and tried to apply the tutorial to my dataset. It's just not running for a large dataset.

The shape of my initial sample is- (100000, 112).

Then I run the block of code in the tutorial(changing 'price' to 'value):

# Initialize the structured data regressor.
reg = ak.StructuredDataRegressor(
    overwrite=True,
    max_trials=300) # It tries 10 different models.
# Feed the structured data regressor with training data.
reg.fit(
    # The path to the train.csv file.
    train_file_path,
    # The name of the label column.
    'value',
    epochs=10)
# Predict with the best model.
predicted_y = reg.predict(test_file_path)
# Evaluate the best model with testing data.
print(reg.evaluate(test_file_path, 'value'))

I've waited over a hour and nothing has happened. When I try this on the tutorial dataset it runs fairly quickly. I've also tried cpu, gpu and tpu to no avail.

Is there anything I can do? This dataset is a sample, my actual dataframe shape is over 1M rows and 250 columns.

gautambak avatar Aug 21 '20 21:08 gautambak

Issue maybe not related to autokeras. When I try a much smaller dataset, I get a message -

/usr/local/lib/python3.6/dist-packages/kerastuner/engine/metrics_tracking.py:92: RuntimeWarning: All-NaN axis encountered
  return np.nanmin(values)

But when I use a larger datset, that message doesn't even appear. 

gautambak avatar Aug 22 '20 16:08 gautambak

I have never seen this error before. We just changed the search space for structured data tasks. You may try again with 1.0.8. Le tme know if it still doesn't work.

haifeng-jin avatar Aug 27 '20 01:08 haifeng-jin

Hi Haifeng, Thank you for the reply. I am using 1.0.8.

import autokeras as ak
ak.__version__
1.0.8

Not sure if it makes a difference but my data is large(800mb uncompressed), has mixed categorical and continuous datapoints, and has quite a few NAN values.

So far, I've encoded the data, scaled it and ran both the structured data regressor as well as the classifier. Both are having the same behaviour.

gautambak avatar Aug 27 '20 20:08 gautambak

If it helps, this is my code (after I normalize and encode my data):

df.to_csv("./encoded.csv")
autokerasFile = "./encoded.csv"
import autokeras as ak
# try using autokeras for imputation
# Initialize the structured data classifier.
clf = ak.StructuredDataClassifier(
    overwrite=True,
    max_trials=3) # It tries 3 different models.
# Feed the structured data classifier with training data.
clf.fit(
    # The path to the train.csv file.
    autokerasFile,
    # The name of the label column.
    'Exchange',
    epochs=100,
    verbose=True)
# Predict with the best model.
predicted_y = clf.predict(autokerasFile)
# Evaluate the best model with testing data.
print(clf.evaluate(autokerasFile, 'Exchange'))

At this point is pretty much just hangs. I can see memory usage changing but no output.

Let me know if there is any other information I can provide - Thank you.

Edit - My keras model ran within a few seconds of starting it. The AK model has yet to start and it's been over 15 hours. When testing with other datasets, it seemed to work..it seems to be something with my dataset I suspect but I can't figure out why because I'm not getting any messages/errors.

gautambak avatar Aug 27 '20 20:08 gautambak

I thought maybe because my data has many different types of data it would be a issue so I created a dict of column_types for the fit function but no luck.

Also if it helps I'm running this on a GPU and a TPU. Both have the same effect, except GPU seems to crash within few min due to memory crash but TPU just keeps running.

gautambak avatar Aug 28 '20 21:08 gautambak

For fun, I tried to take my df and do .fillna(0) - now autokeras start within a few seconds. Is there a way to do classification with NaN values in a dataset?

gautambak avatar Aug 28 '20 21:08 gautambak

Hi there,

Is there anything I can do? I've been testing this and it's really strange, a model will work on keras if I send it to the dataframe but when I take those same dataframes and pass them to autokeras, it just hangs.

gautambak avatar Sep 02 '20 13:09 gautambak

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 01 '20 15:11 stale[bot]

I'm experiencing the same issue.

@gautambak did you find a solution?

mereldawu avatar Nov 25 '21 17:11 mereldawu

This issue is blocking me too.

harrypotter90 avatar Oct 20 '22 15:10 harrypotter90

It is mainly because AutoKeras would iterate through the dataset once to get some information before it starts. It hangs forever? I will need a snippet to reproduce on colab to debug it.

haifeng-jin avatar Oct 24 '22 20:10 haifeng-jin