going_deeper expected dense_10 shape (None, <dense_size>) but got array shape (<batch

Batch size is 2048. Getting the training data from the q I see

x_train, y_train = train_q.get()
print('X.. ' + str(x_train.shape) + ' Y.. ' + str(y_train.shape))

X.. (2048, 2980) Y.. (2048,)

as expected, but

ValueError: Error when checking target: expected dense_10 to have shape (None, 175) but got array with shape (2048, 1)

Update: made the Dense GInftly's different sizes, so now I know it's happening on the 2nd one.

...
model += Dense(dense_size/4)
model += Dense(dense_size/2)
model += GInftlyLayer(
        'dfc1',
        w_regularizer=(c_l2, 1e-3),
        f_regularizer=(c_l2, f_reg),
        reweight_regularizer=False,
        f_layer=[ 
            lambda reg: Dense(dense_size/2),  ....

ValueError: Error when checking target: expected dense_10 to have shape (None, 87) but got array with shape (2048, 1)

Making the size-adjusting Dense(dense_size/2, trainable=False) didn't help.

Dec 26 '17 06:12 phobrain

This seems to be an issue with the used target/loss-function. Could you post the last layer of your model and the used loss-function?

Do you use a binary-crossentropy? If you use categorical-crossentropy, you probably have to one-hot encode your data.

Dec 26 '17 15:12 kutoga

Aha, I just copied the loss function from your mnist example, as I earlier copied a siamese mnist example, so I have

model += GInftlyLayer(
        'dfc1',
        w_regularizer=(c_l2, 1e-3),
        f_regularizer=(c_l2, f_reg),
        reweight_regularizer=False,
        f_layer=[
            lambda reg: Dense(dense_size/2),
            lambda reg: Dropout(0.1),
            lambda reg: GammaRegularizedBatchNorm(reg, max_free_gamma=0.),
        ], h_step=[
            lambda reg: Activation('relu'),
        ],
        w_step=w_step,
)
model.init(
        optimizer='adadelta', # TODO: adadelta needs to store the state; that is quite tricky, I think...
        loss='categorical_crossentropy',
        metrics=['categorical_accuracy']
)

My data are normalized 1D image histograms concatenated with a binary keyword vector. Predictions I got from the siamese model are apparently euclidean_distance() values wherein 0 would be identity, and the loss function is

def contrastive_loss(y_true, y_pred):
    '''Contrastive loss from Hadsell-et-al.'06
    http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf
    '''
    margin = 1
    return K.mean(y_true * K.square(y_pred) +
              (1 - y_true) * K.square(K.maximum(margin - y_pred, 0)))

I realized from your question that I'm not condensing my last layer to one number that could be used as a distance/weight in my weighted selection of photos, if that's possible, so looking into how to do that.

Here are the first results I got from siamese models:

http://phobrain.com/pr/home/siagal.html

Dec 26 '17 23:12 phobrain

If you really just want to output a single number, then you could fix your network in this way:

model += GInftlyLayer(
        'dfc1',
        w_regularizer=(c_l2, 1e-3),
        f_regularizer=(c_l2, f_reg),
        reweight_regularizer=False,
        f_layer=[
            lambda reg: Dense(dense_size/2),
            lambda reg: Dropout(0.1),
            lambda reg: GammaRegularizedBatchNorm(reg, max_free_gamma=0.),
        ], h_step=[
            lambda reg: Activation('relu'),
        ],
        w_step=w_step,
)
model += Dense(1, activation='sigmoid')
model.init(
        optimizer='adadelta', # TODO: adadelta needs to store the state; that is quite tricky, I think...
        loss='categorical_crossentropy',
        metrics=['categorical_accuracy']
)

I used a sigmoid-activation to get an output value between 0 and 1, maybe you require another output-activation.

Btw: I played around with your website. As I can see, you try to "cluster" image pairs that have some kind of similarity? There exist interesting works in the topic "Deep Metric Learning" and in (un)supervised clustering with neural networks. Some used models are quite complex, but they could produce interesting image pairs.

Dec 27 '17 00:12 kutoga

Great! One more problem skated over with a slight jump, then the next blocker:

ValueError: You are passing a target array of shape (2048, 1) while using as loss 'categorical_crossentropy'. 'categorical_crossentropy' expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:

from keras.utils.np_utils import to_categorical
y_binary = to_categorical(y_int)

Alternatively, you can use the loss function sparse_categorical_crossentropy instead, which does expect integer targets.

-> With sparse_categorical_crossentropy; I'm an adult script kiddie as we used to call them..

Now maybe it's a python thing with adapting the mnist example again:

        x_test, y_test = validation_data
        p = model._model.predict(x_test, batch_size=batch_size)
        total = len(y_test)
        ok = np.sum(np.argmax(y_test, axis=1) == np.argmax(p, axis=1))

ok = np.sum(np.argmax(y_test, axis=1) == np.argmax(p, axis=1))

File "/home/phobrain/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 963, in argmax return _wrapfunc(a, 'argmax', axis=axis, out=out) File "/home/phobrain/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc return getattr(obj, method)(*args, **kwds)

numpy.core._internal.AxisError: axis 1 is out of bounds for array of dimension 1

Re Btw: If you imagine each pair as a 2-point vector in the dimensions of the mind, with the magnitude of the vector matching the reaction of a given person in the moment, analogous to the vector in Hinton's capsules, I am in effect mapping my mind as a prototype / first step towards creating a new initially-mimetic life form that can interact with people in a variety of ways and hopefully provide some kind of mass therapy for humanity. A corollary is that it could also be a universal ID system, and if only because of that might need to be sharded to everyone's browser on a vast scale to avoid any one human or group getting control.

Most of that thinking has gelled since I went from 1 pic to 2 a year ago; here's a little paperlet/proposal that inspired the 2 pic version:

http://phobrain.com/pr/home/SCMETA_Ross.pdf

For another mode of interaction I have in mind, see 'knuckle snake' here:

https://www.reddit.com/r/MachineLearning/comments/7blhax/p_visualization_of_high_dimensional_dataset_for/

Can you divulge what are you working on?

Dec 27 '17 00:12 phobrain

I never used sparse_categorical_crossentropy, but I think it should just work with your data. The alternative is to use to_categorical. If you use to_categorical, you just can use the MNIST-example code. In the MNIST-example exactly this is done:

https://github.com/kutoga/going_deeper/blob/master/experiment_01_mnist.py#L29

...
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)
...

You can reuse the code to detect the number of classes (line 19) or just define the num_classes-variable with the correct class count.

If you do this, you just can copy the code of the MNIST-example for the training. Unfortunately, the code for the experiments is not extremely well documented, but it doesn't to any fancy things.

Let me know if this doesn't help.

Re Re Btw: Your ideas in the linked document and on the website sound quite interesting, but I do get everything until now. Are there some more documents to read? Also maybe some more about the currently used 10 algorithms / distance functions? Which algorithms are used, and why? I think it is quite an interesting task to solve your problem with neural networks. At least it should be possible to train a good distance function for the images. Are you currently only working on this project, or also on some other projects?

Can you divulge what are you working on?

Currently, I'm almost only working on neural networks that can cluster data in an end-to-end fashion. I do this for my master thesis (and because it is fun to do this). The input of the neural network is a set of objects (e.g. images, audio, point data, ...) and the output is a set of clusters. This topic is quite interesting:) Like almost any other student, I also do all the time some small "side-experiments" and sometimes they result in something like this git-repository.

Dec 27 '17 03:12 kutoga

to_categorical - so I'd get two weights, for yes/no? I'd rather get weights or distance than a 0/1 choice, so I can sort the matches to a photo.

Rerere:

The distance functions for color distance are just rgb cartesian and a bunch of histograms like mentioned on the Calculations link off the results, e.g. Hue*Sat 48x48 is my workhorse going into nets. I started with Cartesian distance for the histograms, used another for a while, and am now using Poincare spherical. But that's all water under the dam, since it yielded no better than 20% good pairs, vs. at least 50% when I feed my siamese results back into my pair generator. All the docs are off of:

http://phobrain.com/pr/home/explain.html

The real secret sauce is in how I map user mouse movement/click into weighted photo selection, distantly inspired by cookie-based bids for ad selection for your web page, but e.g. without cookies, aside from satisfying my personal taste (enabling me to look at ~300K pairs to develop the training data), I don't think there is any magic in it, and the experience certainly hasn't gotten other people's interest, all the more reason to abandon it for new ideas I'm germinating now, including above a little.

This is my main project, and with the political situation and humanity's overall situation, I'm turning it more and more toward saving the world, with little effect on day to day operations as a retired person. One fun thing I'd like to try if someone threw me the resources would be to train with one pic as the input and the other as the result, to see what I'd get. A side effort is just thinking vaguely about high dimensional things and how we can perceive them.

On clustering, at some point I will want to map people (and photos and pairs) into clusters both retrospectively in bulk and dynamically for individuals, in order to generate hypotheses in some black box way for the personality type they are, what they are feeling, and how to respond, just as we size up another person or animal we are interacting with. Since I'd have to pay people to look at the site as it stands if I wanted that sort of data, I'm concentrating on my other ideas for developing the experience, to see if that will turn the tide, maybe attract others more capable.

The reason I like this GInftly project is that it automates just the thing I was thinking of automating as a model grid search by python. I wonder if pytorch would lend itself more than keras to the dynamic insertion of nodes.

Back to data prep roulette or pinball.. converting 0's (accepted NN suggestions) and 4's (keyword matches) into 1's (training 1's) or 2's (training 0's). Note the different 1-to-2 ratios for vertical and horizontal, which is the data-side impetus for fooling with another type of net, since the conclusion is that symmetry must be broken for vertical/portrait-oriented pairs:

 count  | status | vertical
--------+--------+----------
    976 |      0 | t
   3288 |      0 | f
  94084 |      4 | t
 121195 |      4 | f
  31618 |      1 | t
 183563 |      1 | f
  39041 |      2 | t
  24647 |      2 | f
     21 |      3 | t

So javascript tools for slicing and dicing lists of pairs for bulk approvals is the word of the day, since clicking Y/N on 200K pairs is tedious no matter how they are arranged. Another vague hope is that people will get together to develop a generalized image tagging workbench, ideally with bulk object recognition built in, wherein I'd contribute to the human factors, since I've designed a variety of workbenches, including for molecular modeling and news editing/analysis for a news agency.

Dec 27 '17 04:12 phobrain

to_categorical - so I'd get two weights, for yes/no? I'd rather get weights or distance than a 0/1 choice, so I can sort the matches to a photo.

If you just want a distance as output, then to_categorical probably doesn't make sense. It depends on if your distances are dicrete values (=classes) or if they are just a real number.

If they are a real number, then I would suggest to use the squared error as loss and let the network just return a single value.

If you have classes, then to_categorical makes sense. But it seems you want to have (positive) real numbers as a result?

(Re)^4:

Thanks for the very detailed explanations:)! Also thanks for the link with even more explanations.

Your project is really interesting. Probably you should think about buying a Titan X or maybe some cheaper GPU to do some tests. Then you can train a network where you directly input the complete image, instead of some feature-representations. If you have time, it is in general quite interesting and a great fun to play around with deep learning things (especially in combination with a powerful GPU).

The reason I like this GInftly project is that it automates just the thing I was thinking of automating as a model grid search by python. I wonder if pytorch would lend itself more than keras to the dynamic insertion of nodes.

The idea of this GInfty-thing is really to create more general models. Probably, if it works, this could be helpful in many scenarios. Unfortunately, only a very few tests are done, which makes it quite hard to determine if it is helpful for your project. Probably, I would focus on more standard architecture and just test it if you have enough time. Probably, you are right that PyTorch would be the better choice for an implementation. PyTorch should allow it to implement this layer in a much more natural way. Unfortunately, I've never used PyTorch (this point is still on my TODO list), but I'm sure it would be a better implementation.

Back to data prep roulette or pinball.. converting 0's (accepted NN suggestions) and 4's (keyword matches) into 1's (training 1's) or 2's (training 0's). Note the different 1-to-2 ratios for vertical and horizontal, which is the data-side impetus for fooling with another type of net, since the conclusion is that symmetry must be broken for vertical/portrait-oriented pairs:

Hm, therefore you have 5 discrete values for the status (0, 1, 2, 3 and 4)? But for the neural network based solution you just want to get more or less a "binary value" that suggest an image or not? As you show, your data is imbalanced? This makes the training a bit trickier, but often a weighted loss-function already helps quite well.

So javascript tools for slicing and dicing lists of pairs for bulk approvals is the word of the day, since clicking Y/N on 200K pairs is tedious no matter how they are arranged. Another vague hope is that people will get together to develop a generalized image tagging workbench, ideally with bulk object recognition built in, wherein I'd contribute to the human factors, since I've designed a variety of workbenches, including for molecular modeling and news editing/analysis for a news agency.

Image tagging probably should already be possible with a few networks. E.g. if you use something like in the following post described:

https://www.linkedin.com/pulse/how-make-neural-networks-describe-images-s-k-reddy/

But again, as soon as you start to play around with such huge networks and if you like to re-train some of them, I highly recommend to use GPUs. Maybe GPUs in the amazon cloud, but I prefer a local one:)... Without GPUs, it is not that a great fun.

I am happy that you work on your project and you seem to have fun with it. And it is not only a just-for-fun project, but it can be really useful:)! But I see that is not that easy to get the labels for the 200K image pairs. Maybe if you use a pre-trained network that generates good tags or can be used for very general object detection tasks, it is easier to match two images.

Dec 28 '17 14:12 kutoga

Nothing like 7K decisions on data to sharpen one's mind for programming. Starting from where I left off, I was going from a GINftly Dense of dense_size/2 directly to my sigmoid Dense(1, activation='sigmoid'). Adding a normal Dense(dense_size/2) after the GInftly before the sigmoid gets me the below, showing the histogram and prediction values - note the prediction values are 1==match [db status 1], 0==no match [db status 2, plus another table of seen and rejected for further consideration].

Likely this is significant: "loss: nan - categorical_accuracy: 1.0000" Changing to binary_accuracy, which gives e.g. "loss: 0.2511 - binary_accuracy: 0.4819".

2017-12-30 00:41:24.577499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.645
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.63GiB
2017-12-30 00:41:24.577515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1055] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Rebuild model...
Model parameter count: 580319
Additional unused parameters: -39504
Iteration 0
vshape (array([[  0.00000000e+00,   1.63825838e-05,   1.63825838e-05, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  3.49127032e-05,   1.34413907e-03,   2.22393919e-02, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  0.00000000e+00,   0.00000000e+00,   9.06587573e-06, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  3.82531372e-02,   5.44943602e-02,   5.90752180e-02, ...,
          0.00000000e+00,   0.00000000e+00,   1.00000000e+00],
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]]), array([ 0.,  0.,  1., ...,  1.,  0.,  1.]))
 |----------------------------------------------------------------------------------Epoch 1/1---------| 0.0% 
2048/2048 [==============================] - 0s - loss: 0.2507 - binary_accuracy: 0.5078
2048/2048 [==============================] - 0s
dfc0.w = 1.0
dfc1.w = 1.0
Rebuild model...
Model parameter count: 580319
Additional unused parameters: -39504
 |####################################################################################################| 100.0% 
TOTALLLLLL  2048 p (2048, 1)
Rebuild model...
Model parameter count: 580319
Additional unused parameters: -39504
Iteration 0
vshape (array([[  1.86794007e-04,   1.74341073e-03,   1.55039025e-02, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  9.75562893e-01,   1.49892808e-01,   6.50946888e-02, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.01818974e-02,   7.03830613e-03,   8.70753919e-01, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       ..., 
       [  0.00000000e+00,   0.00000000e+00,   0.00000000e+00, ...,
          1.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  9.59749874e-01,   7.72895513e-02,   4.80023198e-02, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00],
       [  1.58532803e-01,   3.97479845e-02,   2.09562393e-02, ...,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00]]), array([ 1.,  0.,  1., ...,  0.,  0.,  0.]))
 |----------------------------------------------------------------------------------Epoch 1/1---------| 0.0% 
2048/2048 [==============================] - 0s - loss: 0.2511 - binary_accuracy: 0.4819
2048/2048 [==============================] - 0s
dfc0.w = 1.0
dfc1.w = 1.0
Rebuild model...
Model parameter count: 580319
Additional unused parameters: -39504
 |####################################################################################################| 100.0% 
TOTALLLLLL  2048 p (2048, 1)
Traceback (most recent call last):
  File "asym.py", line 869, in <module>
    doit()
  File "asym.py", line 791, in doit
    ok = np.sum(np.argmax(y_test, axis=1) == np.argmax(p, axis=1))
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 963, in argmax
    return _wrapfunc(a, 'argmax', axis=axis, out=out)
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
numpy.core._internal.AxisError: axis 1 is out of bounds for array of dimension 1

Where

    p = model._model.predict(x_test, batch_size=batch_size)
    total = len(y_test)
    print('TOTALLLLLL  ' + str(total) + ' p ' + str(p.shape))
    # TOTALLLLLL  2048 p (2048, 1)
    ok = np.sum(np.argmax(y_test, axis=1) == np.argmax(p, axis=1))  ## line 791
    nok = total - ok
    print("test: {}/{} ok, p={}".format(ok, total, ok / total))

Maybe that'll be obvious enough, so I won't dig for now; note it went to a second epoch 1/1 without that "test: {}/{}..." line of yours printing, but my TOTALLL line printing twice - I don't get it. Here's the end/init of my model:

    model += Dense(dense_size/2)
    model += Dense(1, activation='sigmoid')
    model.init(
            optimizer='adadelta', 
            #loss='kullback_leibler_divergence',
            loss='mean_squared_error',
            metrics=['binary_accuracy']
    )

On to the loss. Yes, 0=identity, up to whatever limit the system wants, but in light of that maybe I should switch my 1 for match and 0 for no match.

What about kullback_leibler_divergence? Sounds portentous, must be good for something.. might work:

https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

doesn't change the above result. cosine_proximity could be worth a try based on a past lifetime. That and your suggested plain and practical mean_squared_error wind up with the same error, so I'll see what you see in the above.

Re^N:

Then you can train a network where you directly input the 
complete image, instead of some feature-representations. 
If you have time, it is in general quite interesting and a great 
fun to play around with deep learning things (especially in 
combination with a powerful GPU).

I started with inceptionv3, and if I remember right I had to use batch size of ~5 on my 1080ti, since each case has 2 pics, and it seemed kind of slow, so since I got pretty good mileage with histograms, rather than try with slower-proc images, my next inspiration was to do a massive data shaping, since I figure 'anyone' can do the deep learning, and who know what will emerge in the field while I'm shaping the data, but I am the only person knowing/caring how to answer these quasi-rorschach puzzles. Once I have that data together, I hope to retire from that eye-numbing pastime and work full-time on nets (using full pics too).

Maybe I'll be able to attract a grant to go full-scale on cloud servers, and maybe try to train nets to synthesize one scene from another. Either way, I hope to replace myself at the generation task, hopefully in the short run no longer using curated pairs for the main view.html view, and I might try importing labeled non-artistic photos from some public source and see how it works on them. And at that point beginning to work on feeding user behavior into some recurrent net to modify the responses like my current chewing-gum-and-dental-floss AI does but better, then on to synthesized meaning, whatever that is.

Image tagging probably should already be possible with a few networks.

I tried retraining inceptionv3 per Tensorflow for Poets, and found it didn't like the sorts of categories I use:

http://phobrain.com/pr/home/tf_poet.html

So not having the resources to do full-on image net training (1 card), I wound up shelving that, but having 5K new images at the moment, of which I'll select and write 1-8 keywords for probably 3K, makes auto-tagging more appealing. I'd like to just download a workbench, point it at a dir full of photos, specify my output format, and then step through the photos, clicking 'tag' on each one, seeing suggestions pop up, being able to edit/add/delete tags, and save them to file. Ideally you'd be able to manipulate the image too, e.g. brighten and crop, and have it save to dimensions your net expects.

As you show, your data is imbalanced? This makes the training a bit trickier, 
but often a weighted loss-function already helps quite well.

I trained on 50/50 positive/negative cases since it seems to work better, and tested on 20/80 positive/negative cases to match the shape of my workbench-suggested data before I added the predictions back in and separately started keyword-driven bulk addition. I haven't figured out how to shape my test data for the second round, I'm still maybe as little as a sixteenth of the way onto shaping the training data. The tradeoff is between Disney-obvious pairs in huge numbers from the bulk keyword matches, like flowers and towers, and more unique and interesting cases from the workbench. E.g. the 32K-dimensional Golden Angle derived pairs provided a whole different type of sampling for a block of training cases.

Dec 30 '17 08:12 phobrain

I added a new view with just photos containing these keywords:

http://phobrain.com/pr/home/view.html

Phob->Photogs/Subjects->Select (5K pics)

flower tower bridge spire dome face faces 
sculpture graffiti downtown juxtapose_align 
juxtapose juxtapose_old_new juxtapose_pattern 
juxtapose_color juxtapose_concept juxtapose_size 
juxtapose_texture juxtapose_opposite juxtapose_ontop

Then if you

Phob->Search Mode: AI

The selection forces the correlation rate high enough so that I think e.g. {Sigma-x for random selection of one of the other Sigma options} to form a spontaneous match is perhaps more interesting than my default curated experience on the whole set of photos, which can be seen for the current Photog/Subject via the 'c' option when in Search Mode. Plus most interesting to compare to random, '|' which is somewhat interesting due to the refined selection it operates on, but the nonexistent sequencing soon seems obvious to me.

I'm thinking of switching some variant of that to my default experience for the new year.

Dec 30 '17 11:12 phobrain

Sorry for the late answer, but I was in the holidays.

Likely this is significant: "loss: nan - categorical_accuracy: 1.0000" Changing to binary_accuracy, which gives e.g. "loss: 0.2511 - binary_accuracy: 0.4819".

Oke. At the moment I do not see why there is a NaNs, but this is something that has to be fixed. Thanks for the error report.

    model += Dense(dense_size/2)
    model += Dense(1, activation='sigmoid')
    model.init(
            optimizer='adadelta', 
            #loss='kullback_leibler_divergence',
            loss='mean_squared_error',
            metrics=['binary_accuracy']
    )

If binary_accuracy is used, then probably binary_crossentropy should be the better choice for the loss-function (if it does not produce NaN-values; but this should not be the case).

What about kullback_leibler_divergence? Sounds portentous, must be good for something.. might work:

https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

doesn't change the above result. cosine_proximity could be worth a try based on a past lifetime. That and your suggested plain and practical mean_squared_error wind up with the same error, so I'll see what you see in the above.

The KL-divergence is quite powerful. I used in some othe rprojects in a similar way like Lukic et al.: https://pd.zhaw.ch/publikation/upload/212963.pdf It may be helpful for you if you generate embeddings like in the linked paper; otherwise I do not directly see that it can be used for your use-case (but I did not spend much time to think about this; maybe it could be used for your use-case).

I started with inceptionv3, and if I remember right I had to use batch size of ~5 on my 1080ti, since each case has 2 pics, and it seemed kind of slow, so since I got pretty good mileage with histograms, rather than try with slower-proc images, my next inspiration was to do a massive data shaping, since I figure 'anyone' can do the deep learning, and who know what will emerge in the field while I'm shaping the data, but I am the only person knowing/caring how to answer these quasi-rorschach puzzles. Once I have that data together, I hope to retire from that eye-numbing pastime and work full-time on nets (using full pics too).

Did you also try to used downscaled images? I think if you use a resolution of ~256x256, you already should get good results and probably then the network should be faster? Almost for sure not as fast as with histograms, but still quite fast.

I added a new view with just photos containing these keywords:

http://phobrain.com/pr/home/view.html
Phob->Photogs/Subjects->Select (5K pics)

flower tower bridge spire dome face faces 
sculpture graffiti downtown juxtapose_align 
juxtapose juxtapose_old_new juxtapose_pattern 
juxtapose_color juxtapose_concept juxtapose_size 
juxtapose_texture juxtapose_opposite juxtapose_ontop
Then if you
Phob->Search Mode: AI
The selection forces the correlation rate high enough so that I think e.g. {Sigma-x for random selection of one of the other Sigma options} to form a spontaneous match is perhaps more interesting than my default curated experience on the whole set of photos, which can be seen for the current Photog/Subject via the 'c' option when in Search Mode. Plus most interesting to compare to random, '|' which is somewhat interesting due to the refined selection it operates on, but the nonexistent sequencing soon seems obvious to me.

I'm thinking of switching some variant of that to my default experience for the new year.

Thanks for the new view and the additional description:)!

Jan 07 '18 23:01 kutoga

Revised: solved by changing my label dimension. My y_test.shape was (2048,) while p is (2048, 1). I had

    read_file(pair_dir + '/' + pr_type + '.pos', pairs, labels, 1.) # 1. == label
        ... labels += [label]

and changing that to += [[label]] gets things to run. Chug, chug, GPU-Util usually 0%, with a flash of 2 or 3.

If binary_accuracy is used, then probably binary_crossentropy should be the better choice for the loss-function (if it does not produce NaN-values; but this should not be the case).

Using that, here is the plot:

http://phobrain.com/pr/home/gallery/w_5e-05_1e-08.png

KL-divergence ... Lukic et al.

Looks like they are training with labels of 1=match, 0=no_match, and getting distances of 0=identity, which is exactly what I want to do, not sure if I am; so I wish I could see their code to map. But they do symmetric KL divergence, whereas it seems like I'd want it asymmetric, and not sure how keras does it? Read the code, Luke; this looks asymmetric to me:

def kullback_leibler_divergence(y_true, y_pred):
  y_true = K.clip(y_true, K.epsilon(), 1)
  y_pred = K.clip(y_pred, K.epsilon(), 1)
  return K.sum(y_true * K.log(y_true / y_pred), axis=-1)

Leading to the hope I can use the option off the shelf for my purposes, other things being equal. Drop-in result:

http://phobrain.com/pr/home/gallery/w_5e-05_1e-08_kl.png

In both cases, it seems wrong that one, maybe both of the fc layers are at 1? Not sure what those lines represent.

Did you also try to used downscaled images?

Inceptionv3 uses 299x299.

Thanks for the new view and the additional description:)!

I have been recreationally selecting matches from my nets aggregated votes in that area, and it is big enough that I forget it's just a corner of the space. Less recreationally, there are ~70K bulk keyword vertical pairs to chaw through to produce a first cut of my asymmetric training data.

While you were out having a good time :-), I made a pitch to the blockchain community to help me save the world by protecting and augmenting one's psychological and economic identities (not sure if the two can be tied, but hoping each facet can guarantee the other):

https://www.reddit.com/r/BlockChain/comments/7o093g/decentralized_identity_verification_via_behavior/

As you can see in the discussion, plenty of openings remain in the org chart.

A few pairs:

https://forums.craigslist.org/?ID=287554777

Jan 08 '18 08:01 phobrain

Using that, here is the plot:

http://phobrain.com/pr/home/gallery/w_5e-05_1e-08.png

It doesn't look great, but at least the loss is going down. Do you use much data (& validation data)?

Looks like they are training with labels of 1=match, 0=no_match, and getting distances of 0=identity, which is exactly what I want to do, not sure if I am; so I wish I could see their code to map.

Yes, that is how they are doing it. They do it in a symmetric way, because in their problem statement this is more natural: If you compare two speakers, the result should not depend on the order of the speakers. I know their code and it is not really magic, they just use the relatively simple formula which they describe in the paper.

But they do symmetric KL divergence, whereas it seems like I'd want it asymmetric, and not sure how keras does it? Read the code, Luke; this looks asymmetric to me:
def kullback_leibler_divergence(y_true, y_pred):
  y_true = K.clip(y_true, K.epsilon(), 1)
  y_pred = K.clip(y_pred, K.epsilon(), 1)
  return K.sum(y_true * K.log(y_true / y_pred), axis=-1)

Yes, the KL-divergence is in general asymmetric and, therefore, it is implemented like this in Keras.

Leading to the hope I can use the option off the shelf for my purposes, other things being equal. Drop-in result:

http://phobrain.com/pr/home/gallery/w_5e-05_1e-08_kl.png

In both cases, it seems wrong that one, maybe both of the fc layers are at 1? Not sure what those lines represent.

The KL-divergence should not be used to compare a single output value, but to compare distributions; this means you should not only change the loss-function. Lukic et al. do this by using a softmax-layer (without predefined fixed classes) and then using the symmetric version of the KL-divergence to compare a pair of network outputs.

Did you also try to used downscaled images?

Inceptionv3 uses 299x299.

Ok:)

While you were out having a good time :-), I made a pitch to the blockchain community to help me save the world by protecting and augmenting one's psychological and economic identities (not sure if the two can be tied, but hoping each facet can guarantee the other):

https://www.reddit.com/r/BlockChain/comments/7o093g/decentralized_identity_verification_via_behavior/

As you can see in the discussion, plenty of openings remain in the org chart.

A few pairs:

https://forums.craigslist.org/?ID=287554777

Thanks for the interesting discussion and links:)!

Jan 08 '18 21:01 kutoga

Do you use much data (& validation data)?

25K/25K pos/neg pairs to train; 2.8K/14K to test, which reflects the ratio in the wild. I could bump the width from greyscale_128+keyword_vector_1200ish to hue_sat_2304+kwd_vector to improve the quality of the data.

My newbie question is, if I'm providing 1 and 0 as pos/neg labels, wouldn't the final Dense(1, sigmoid) train to produce 1==pos, vs. 0==neg, vs. 0==close..some_X==far? E.g. simplified to Input/Dense(32)/Dense(1, sigmoid). If I had to guess, the loss function would be what determines it, unless it requires a different sort of net.

Jan 09 '18 01:01 phobrain

I think the amount of pairs should be sufficient.

Yes, if you define 1 as positive and 0 as negative label, then the network exactly does this. As you mentioned, a final layer with sigmoid makes sense. The higher the value is (maximum is 1), the more probable it is a positive result. The lower the value is (minimum 0), the more probable the images do not match.

The loss function is very important. The most natural choice for such a binary classification problem is the binary crossentropy. If the network output is interpreted as a probability (which can easily be done, because the value range is [0,1]) the log likelyhood can be maximized, this directly leads to the binary crossentropy (e.g. see https://www.quora.com/What-are-the-differences-between-maximum-likelihood-and-cross-entropy-as-a-loss-function/answer/Jonathan-Gordon-23?srid=hTDfW )

Of course, other loss-functions like the mean squared error also work, but probably less effective, because it is less natural for the given problem.

As soon as everything works, it might also be interesting to test the network output for the confidence. This can easily be done with forward dropout: https://arxiv.org/pdf/1506.02142.pdfimportant.

Jan 09 '18 09:01 kutoga

Btw: The code for the symmetric KL-divergence is now online available: https://github.com/stdm/ZHAW_deep_voice/tree/master/networks/pairwise_kldiv

Jan 10 '18 07:01 kutoga

symmetric KL code

Thanks!

I set labels to 0.=match, 1.=no_match, switched to vertical pairs (smaller sample, but the one I'm interested in) and added Hue*Sat 48x48 to greyscale and keywords for input width of 7588 and dense_size of 1046, added relu's to the first, second-to-last, and a middle non-GINtly Dense, and using binary crossentropy I got

http://phobrain.com/pr/home/gallery/t1_w_5e-05_1e-08.png

which took about 7 minutes, so I tried setting up to run 1000 epochs, and at epoch 102 got

2018-01-09 23:33:38.330500: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 30.28MiB.
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[7588,1046]
	 [[Node: training_102/Adadelta/gradients/dense_1_103/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, _class=["loc:@dense_1_103/MatMul"], transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_input_1_0_2/_6765, training_102/Adadelta/gradients/dense_1_103/Relu_grad/ReluGrad)]]

Switched adadelta to RMSProp, got to Iteration 155:

Rebuild model...
Model parameter count: 9998453
Additional unused parameters: -1375490
...
Iteration 155
 |------------------------------------------------------------------------------Epoch 1/1-------------| 0.0% 
2018-01-10 00:02:46.242492: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 8.0KiB.
...
2] Stats: 
Limit:                 10916963943
InUse:                 10916963840
MaxInUse:              10916963840
NumAllocs:                   26037
MaxAllocSize:            113913600

2018-01-10 00:04:27.179113: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ****************************************************************************************************
2018-01-10 00:04:27.179131: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[1]
  File "asym.py", line 791, in doit
    batch_size=batch_size, debug_print=True
  File "/hdd/phobrain/keras/hp2/ginfty.py", line 426, in train_batch
    **kwargs
  File "/hdd/phobrain/keras/hp2/ginfty.py", line 375, in train_step
    res = self._model.fit(x, y, batch_size=batch_size, **kwargs)
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1598, in fit
    validation_steps=validation_steps)
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1183, in _fit_loop
    outs = f(ins_batch)
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2273, in __call__
    **self.session_kwargs)
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/phobrain/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Dst tensor is not initialized.
	 [[Node: _arg_dense_5_sample_weights_155_0_0/_10261 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_644__arg_dense_5_sample_weights_155_0_0", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
	 [[Node: loss_155/mul/_10303 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_653_loss_155/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Jan 10 '18 08:01 phobrain

What happens if you decrease the batch-size? The problem is, if layers are added and the batch-size is already quite high, then there might be some memory issues. This could be the case here.

Jan 10 '18 09:01 kutoga

Batch size 2048->512: OOM at same Iteration 155, slightly different stats:

Limit:                 10916963943
InUse:                 10899258624
MaxInUse:              10916435968
NumAllocs:                   26261
MaxAllocSize:             55796736

Jan 10 '18 11:01 phobrain

Keeping batch 512, going back to just greyscale+kwds, input=2980, dense_size=315:

Model parameter count: 1127264
Additional unused parameters: -126234

http://phobrain.com/pr/home/gallery/t2_w_5e-05_1e-08.png

Jan 10 '18 12:01 phobrain

Going back to my GInfty-removed debug version (grey+keywords), batch_size 2048, it runs 10 epochs per Iteration, and winds up after 100 Iterations at loss: 0.0018 - binary_accuracy: 1.0000 - val_loss: 1.7073 - val_binary_accuracy: 0.6963. Validation accuracy sounds good for a start, 73% being the best I saw with siamese nets using wider data, odd that validation loss is so high.

Throwing in hs48:

Batch 2048:
loss: 2.3588e-04 - binary_accuracy: 1.0000 - val_loss: 2.3872 - val_binary_accuracy: 0.6240. 
Batch 4096:
loss: 3.9485e-04 - binary_accuracy: 1.0000 - val_loss: 2.0337 - val_binary_accuracy: 0.6638
Plus 5 extra Dense's at end for fun:
loss: 1.4810e-04 - binary_accuracy: 1.0000 - val_loss: 2.7888 - val_binary_accuracy: 0.6587

Jan 10 '18 21:01 phobrain

It seems to overfit. Do you use Dropout?

Jan 10 '18 22:01 kutoga

I noticed the overfit.. it looks like validation accuracy just bounces around after the first few Iterations, like in the graphs above, but I haven't graphed yet. Adding BatchNorm+Dropout(.1) after first Dense like in my best grey+hs48+kwd siamese model, dropping the terminal extra Dense's:

loss: 5.8700e-04 - binary_accuracy: 1.0000 - val_loss: 2.7829 - val_binary_accuracy: 0.6531

I'm guessing it might be running twice as fast as the siamese models, since both pics' data is fed in together, rather than (perhaps) serially as laid out in the code, else "they"'d have to be crafting a way to share weights while running both twins at once. GPU use gets up to 22%, vs. 2-3% for the bursty GINfty version.

Bringing in the fit_generator and predict_generator framework from my siamese models, so I can do bulk predictions on all of train/test after fitting, and running multiple times; top GPU %'s are impressionistic. In this case, match = prediction<0.5.

-- epochs=20, top GPU 22%

Train %: 98.64 98.80 98.16 98.87 56.03
Test  %: 59.95 64.10 57.49 58.51 20.59
N: 5 Avg Accuracy, Train: 90.10% Test: 52.13%

-- epochs=30, real 5m52.563s

Train %: 99.72 99.80 99.73 99.73 99.56
Test  %: 66.56 66.22 65.70 66.91 65.41
N: 5 Avg Accuracy, Train: 99.71% Test: 66.16%

-- again/epochs=30, real 5m36.480s, top GPU 38%

Train %: 99.62 99.72 99.72 99.70 99.61
Test  %: 71.22 67.84 68.95 61.33 58.91
N: 5 Avg Accuracy, Train: 99.67% Test: 65.65%

-- epochs=30, real 5m33.440s, top GPU 29% ### RMSProp

Train %: 99.80 99.59 99.76 99.74 98.23
Test  %: 68.67 65.19 66.64 64.10 56.87
N: 5 Avg Accuracy, Train: 99.42% Test: 64.30%

-- again/epochs=30, real 5m42.392s, top GPU 34% ### RMSProp

Train %: 99.81 99.20 99.82 99.79 97.49
Test  %: 65.83 56.79 66.68 64.65 52.91
N: 5 Avg Accuracy, Train: 99.22% Test: 61.37%

-- epochs=30, real 5m41.086s, ### Adamax

Train %: 99.72 99.88 99.79 99.91 99.86
Test  %: 58.20 64.40 60.42 63.92 63.12
N: 5 Avg Accuracy, Train: 99.83% Test: 62.01%

-- epochs=40, real 7m7.615s, top GPU 30%

Train %: 99.86 99.83 99.87 99.82 99.85
Test  %: 67.75 67.38 64.78 62.75 64.03
N: 5 Avg Accuracy, Train: 99.85% Test: 65.34%

-- epochs=40, real 7m3.943s, top GPU 37% ### RMSProp

Train %: 99.84 99.78 99.75 99.79 99.79
Test  %: 64.08 66.30 66.76 65.01 65.72
N: 5 Avg Accuracy, Train: 99.79% Test: 65.57%

Jan 10 '18 22:01 phobrain

Dropping the BatchNorm, top accuracy is 82%, so I'd run 100 times and use the top 20 models, and without playing with the model much, I've boosted top accuracy from ~75% on siamese to ~85%. Maybe this is why the siamese model was removed from the canonical keras examples and deleted from the repository.

-- epochs=30, real 5m47.303s, top GPU 37%

Train %: 68.54 71.20 64.42 71.19 70.05
Test  %: 78.15 59.20 81.90 60.59 56.84
N: 5 Avg Accuracy, Train: 69.08% Test: 67.33%

Later.. adding a tail of 6 Dense's per another dynamic model's generated case, getting test results up to 84%.

-- epochs=30, terminal: 5 dense 256's w/ final 128

Train %: 60.93 52.28 57.56 56.00 58.41
Test  %: 51.47 83.44 82.67 83.45 67.99
N: 5 Avg Accuracy, Train: 57.04% Test: 73.80%

-- last two 128

Train %: 54.23 57.94 50.47 59.59 53.02
Test  %: 83.53 82.85 83.41 80.49 83.61
N: 5 Avg Accuracy, Train: 55.05% Test: 82.78%

Train %: 59.81 60.51 61.08 60.18 63.14
Test  %: 48.17 54.29 51.09 78.04 74.33
N: 5 Avg Accuracy, Train: 60.94% Test: 61.18%

Train %: 52.75 51.90 63.76 61.97 50.04
Test  %: 83.47 83.42 64.76 57.83 83.36
N: 5 Avg Accuracy, Train: 56.08% Test: 74.57%

Note the bad training fit with the good prediction, which I haven't seen before. The loss hardly changes, e.g. starting at 0.6932 and ending at 0.6611.

Jan 11 '18 06:01 phobrain

84% accuracy is already not that bad, but it should be possible to improve that.

Usually, the loss changes much more. It seems the network cannot improve that much. Are there any inconsistencies in the data? Is it always obvious that two images are a pair (at least for a human)? If it is not that clear given the training data, the neural network probably won't improve much.

Jan 11 '18 08:01 kutoga

I think the loss was changing more before I added the 6 Dense's at the end, or maybe it was the Dropout; from above:

 loss: 2.3588e-04 - binary_accuracy: 1.0000 - val_loss: 2.3872 - val_binary_accuracy: 0.6240.

Paring back to a few Dense layers still leaves the loss relatively stuck, though the other numbers change. Another diff is I'm using glorot_normal for kernel_initializer.

Dropping batch_size to 512 gets loss<0.1, training accuracy >98%, but avg test accuracy drops to ~63% at batch=1024.

It's feeling like 84% is a pretty hard limit. Update: added latest mid-shaping-process data, using epochs=5, batch=2048:

Train %: 73.67 70.86 67.65 72.47 71.46 71.19 69.19 70.17 67.36 72.55 69.22 71.87
Test  %: 79.35 84.50 87.80 80.29 63.24 84.35 86.40 83.20 87.92 78.54 87.14 81.97
N: 12 Avg Accuracy, Train: 70.64% Test: 82.06%

Getting back to KL, I'm considering using histogram A as input and histo B as label, with KL for the loss, then taking the predicted histogram and find the nearest pic neighbors in order by (just for the fun of it) Poincare spherical distance. Though given multiple B's (and types) for an A, keyword space would also need to be leveraged somehow.

Jan 11 '18 10:01 phobrain

Ok, it really can be that these 84% are a hard limit. Maybe more data could improve this. Do you use some kind of data augmentation?

Using the (symmetric?) KL-divergence loss on top a fully connected network should work just fine with histograms:)

Jan 12 '18 09:01 kutoga

88% is the highest yet. I'm grinding out predictions from my top 100 models, which will take 2 days.

Written: vert_v/m_model_28_2048_5_69_86.pairs (1482800) in 0:06:55.677205

I might be getting such good results because I use the same set of 17K pics for all the matches, so likely they are being learned individually rather than via generalization, which is fine, but I'm eager to project onto pics from new sources to test that theory, once I'm done with this phase. Using nets trained on vertical pairs on horizontals didn't work well, and vice-versa, though that was before I added keywords. It confirms my original expectation that histograms wouldn't have enough info for generalized mapping. Augmentation might force generalization, if it's possible.

Haven't tried augmenting yet, instead was thinking I'd try with images again once I'm 'done' with generating data, in a 1-3 month time frame, other than random ideas like training to output a histogram, which now has my curiosity aroused, partly because I want to play with tradeoffs in color histograms vs. keyword vector weights. I'd use the asymmetric KL, to connote that I'm matching from left to right. But I'd try reversing the sense for the B->C part of the inter-pair transition [AB]->[CD].

Jan 12 '18 12:01 phobrain

You get new pictures? Labelled training images?

Maybe it already helps if you add more features than just the histogram and the keywords? Anyway, I think the keywords are much more powerful than the histogram.

Anyway:) It is quite interesting.

Jan 13 '18 12:01 kutoga

You can try keywords and histograms directly, with green and yellow '+' options in search Mode: AI, and pure histogram nets in different groupings in the Sigma-[0|theta|0theta] options, act now since I blew away the underlying weights, so they (Sigmas) will disappear in my next upgrade, hopefully in a week or so.

I decided to go for accuracy on positive predictions, and have gotten as high as 97%. I'm in the middle of calcs on horizontal pairs x 99 models, which were trained with loads more data, but still uneven. Then I'll redo vertical pairs with the new focus, and if it looks as good as the numbers suggest, I'll make different groups of nets fire when you click different parts of one or both photos in the default view.

Then I'll see if the new nets predict well on the ~3K photos I'm keywording for addition, to see if any generalization is occurring.

Jan 30 '18 06:01 phobrain

I call it a crime against nature that I can't run my 95MB models on a 1030 Gtx or two, and let them chug away on predictions full-tilt, vs. wasting 85% of the 1080 ti. I'm wondering if a non-tensorflow keras backend would let me run parallel jobs on the 1080. It doesn't seem like an issue others have considered.

Vertical pairs turned out to be a very different animal, since the data has hardly any bulk keyword pairs. They will get different kinds of mappings in the UI, this time grouping nets by the train/test samples and accuracies. So far, there are 5 zones on each pic, center and 4 corners, but if I could run a bunch more models, I'd make each pic like a retina, with maybe 256 zones to map to different sorts of nets.

E.g. here's a pair that purely-histogram-trained nets just chose (Sigma0; me on left):

http://phobrain.com/pr/home/gallery/pair_me_sb_camo_woman_whiteface_stripes.jpg

Feb 06 '18 08:02 phobrain

going_deeper
going_deeper copied to clipboard

expected dense_10 shape (None, <dense_size>) but got array shape (<batch_size>, 1)

going_deeper going_deeper copied to clipboard

expected dense_10 shape (None, <dense_size>) but got array shape (<batch_size>, 1)

going_deeper
going_deeper copied to clipboard