darts Doubt about the effectiveness of the method

Doubt about the effectiveness of the method

Open twangnh opened this issue 4 years ago • 28 comments

Hi, thanks for sharing the work. We have performed a rigorous evaluation for the framework on the cifar-10 setting with this code base, for each setting we run the experiments independently 8 times and average the results(the accuracy of searched archs retrained on whole training set).

first-order is simple baseline of alternatively optimize weight parameters and architecture parameters, lookahead is to peform a step of fake weight update before calculating gradients of architecture parameters, Random is random baseline without search, the curve of orig arch diff epochs is the default method of second order with different epochs. paper is the reported results in the paper.

However the confusion are, 1, simple alternating optimization of first order baseline gets similar results to second order with much smaller model size and it actually performs more than two times faster than second order method, however first order is theoretically too much deviated from the formulation of searching objective:
So maybe we are not even doing architecture search, but just eliminated some bad design choices that is easy to get rid of? 2, increasing training epochs leading to wrose results after some points, which is really counter-intuitive as searching longer should at least gets as good results as less epochs. Could you please give some advice on the issue? Thank you very much.

Jul 17 '19 08:07 twangnh

Hi, @twangnh !

Thanks for your questions. For the first question,

1, simple alternating optimization of first order baseline gets similar results to second order with much smaller model size and it actually performs more than two times faster than second order method, however first order is theoretically too much deviated from the formulation of searching objective:

I'm not pretty sure about it. But according to my understanding, the second order should be slower than the first order, because they have different assumption to the gradient. As you said, second order assumes that weight is defined by alpha. But for the first order one, weight and alpha are equal. The relationship is more like L(w, alpha), compared with L(w(alpha), alpha) (second order).

For the second question, I do know something about it:

2, increasing training epochs leading to worse results after some points, which is really counter-intuitive as searching longer should at least gets as good results as less epochs.

I assume your "training epochs" suggests the "the iteration used during model searching", which I would call it "searching epochs". If so, based on my experience with DARTS, a longer searching epoch tends to "over-searching". That is to say, the searched model would get too many "non-learnable parameters" (i.e. skip-connects and poolings) when searching too long. Clearly, this phenomenon would limit the performance of the searched model. Actually, Progress-DARTS noticed this issue and raised a "search space regularization" to reduce the effect of "searching for too many skip-connects."

Let me know if you have some new findings!

GL,

Jul 18 '19 01:07 Catosine

Hi @Catosine, I have got the below genotypes from DARTS for ultrasound dataset after training for 10 epochs. I was wondering why genotypes were similar during epochs 1 to 10, while accuracy was increasing? Because it supposes to search and find the best-learned cell.

Also, how should I interpret this normal learned cell? why it does have two input nodes for CNN?

Could you please point me in the right direction. Thanks a lot.

Jul 31 '19 15:07 NdaAzr

@NdaAzr Hi! Thanks you for sharing your progress!

For the first question:

I was wondering why genotypes were similar during epochs 1 to 10, while accuracy was increasing?

During the searching stage, DARTS optimizes both the alpha and weight alternatively. So the increasing of accuracy should be considered as the cooperation of both alpha and weight. That is to say, DARTS can still get a good performance even with minor changes in architecture. But please notice that ALL accuracy you get during searching progress does NOT necessary be representative for retrain. You can only evaluate your model after retrain.

For the second quetion:

Also, how should I interpret this normal learned cell? why it does have two input nodes for CNN?

As you may noticed, there are two kinds of cells in DARTS: one is normal cell, which would keep the input and output have exactly the same shape (i.e. channels, height, width); the other is reduction cell, which is used to down-sample the input. And both of the cells have two inputs: one is used to receive the output of previous cell; the other is similiar to a bypass, which receive the previous previous cell. These inputs make the whole network similiar to ResNet.

GL,

Aug 01 '19 01:08 Catosine

Hi @Catosine,

Thanks a lot for explanation. Regarding the second question, I still struggle to understand it.

What I understood in normal cell the green box c_{k-1} is output cell from the previous cell, and c_{k-2} is the output of two previous layer. Moreover, c_{k} is the output of the learned cell.

Also, for c_{k-1} , there are dil_conv_33 which connect to node 0 which is feature map produced by dil_con_33, and also c_{k-1} has max_pool_33 which produce node 1, and max_pool_33 could produce node 3. is that correct? if it's correct, why c_{k-1} could have 3 decision to

Also, each of these black lines such as dil_conv_3*3 is consist of block of operations which in this case is Relu, conv, conv, batchnorm. Could you please correct me if I said wrong?

Many thanks, Neda

Aug 01 '19 13:08 NdaAzr

Hi! @NdaAzr

I think you are very correct with most of things. But there is one more thing to care:

if it's correct, why c_{k-1} could have 3 decision to

Actually, DARTS does not limit the amount of outputs for each nodes and inputs. According to the codes, each node should pick exactly 2 inputs. But c_{k} is a special case: it will take inputs from all nodes and concate them as the output of the cell.

GL,

Aug 02 '19 04:08 Catosine

Hi @Catosine,

Thank you for the reply. As you said, normal cell would keep the input and output have exactly the same shape. So, I'm wondering why I have max_pool and avg_pool in normal cell? Also, why I have sep_conv in reduction cell?

second question is, how for the first epoch, could have input c{k-2}? I assumed the first epoch should have only one input which is an input image.

and last question is based on above learned cell, if c{k-2} would connect to node 0 with sep_cnv_55 and has 16 feature map, and also, c{k-1} would connect to node 0 as well with dil_conv_33 and has 16 feature map, then, second stage, which node 0 would connect to node 1 with avg_pool_33, is node 1 includes 32 feature maps or 16 feature maps will go through avg_pool_33 and 16 feature maps will produce output which is c_{k}.

Thank you very much in advance

Aug 04 '19 17:08 NdaAzr

Hello, @NdaAzr !

Thanks again for your questions! Let's solve them one by one:)

So, I'm wondering why I have max_pool and avg_pool in normal cell? Also, why I have sep_conv in reduction cell?

Well, that's probably be a question that your algorithem could answer you better than me. According to my experience with DARTS, it is very sensitive to the data you feed in--even the train-valid split rate you use. So, pooling in normal cell and sep_conv in reduction cell are just best options for you code. One more thing to add: DARTS uses poolings with padding. So it actually does not change the shape of your data. LOL

second question is, how for the first epoch, could have input c{k-2}? I assumed the first epoch should have only one input which is an input image.

For this question, I would recommend you to check the code, especially model_search.py. In this code, line #73-#76 defines a dowm sampling part named as stem, which is used to process the input data. In the forward function #103-#113, the output of stem is used as c_{k-1} and c_{k-2}. You could check that at line #104.

and last question is based on above learned cell, if c{k-2} would connect to node 0 with sep_cnv_55 and has 16 feature map, and also, c{k-1} would connect to node 0 as well with dil_conv_33 and has 16 feature map, then, second stage, which node 0 would connect to node 1 with avg_pool_33, is node 1 includes 32 feature maps or 16 feature maps will go through avg_pool_33 and 16 feature maps will produce output which is c_{k}.

Sorry I can't catch you. But if you are asking about what happends when two stream of data, i.e. one from c_{k-1} and the other from c_{k-2}, meets at node, i.e. node 0, I could tell you that they simply added together at element level. This is defined at line #58 of model.py.

GL,

Aug 05 '19 02:08 Catosine

Hello, I was wondering what is the average time to get a good policy for creating classification models using Darts on a Google colab server (which has a 16 GB tesla T4 gpu) ?

Thank you :)

Aug 07 '19 10:08 karanchahal

Hi @karanchahal, I assume is very depends on the dataset size. In my case was about 10 hours on Colab for 600 images (training and validation).

Aug 07 '19 15:08 NdaAzr

Hi @Catosine, Thanks a lot for answering the question. I have few more questions, I really appreciate if you could help to clarify these.

Regarding DARTS paper figure 1.b, is this a cell or stack of cells? if it's cell why there are three branches per node? and at the end in figure1.d why just it kept one branch however, based on the code in the cell that DARTS produce each node should pick 2 inputs.
How many nodes per cell we should expect? is it always return 4 nodes except output node?
When we retrain the best cell, should we train it with the best cell learned found from first stage? or does it need to retrain from stack of the cells? for example if it found the genotype like below should I retrain it with the same genotype or stacked of genotypes? genotype = Genotype(normal=[('sep_conv_5x5', 0), ('max_pool_3x3', 1), ('dil_conv_3x3', 0), ('avg_pool_3x3', 1), ('dil_conv_3x3', 3), ('sep_conv_3x3', 0), ('dil_conv_5x5', 0), ('avg_pool_3x3', 3)], normal_concat=range(2, 6), reduce=[('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 2), ('max_pool_3x3', 0), ('sep_conv_5x5', 0), ('sep_conv_5x5', 1), ('dil_conv_5x5', 1), ('skip_connect', 2)], reduce_concat=range(2, 6))
I can see in model_search.py line #63 stem_multiplier=3 which increase the number of initial layer from 16 to 48? Do you think is there any specific reason for that? and what step = 4 and multiplier=4 does in the code?
I am still not sure why DARTS return same genotype during all epochs for my dataset? It makes me worry that I have a mistake in the code! Is it happened in your case to get same genotype during all epochs? how first epoch with 13% acc could be same as epoch 10 with 30% accuracy? I can see that in this Github, they produce a gif animation that displays the cell is changing during all epochs. If I change the dataset size or number of dataset class would produce different genotype. However genotypes are similar during all epochs?

Many thanks, Neda

Aug 07 '19 16:08 NdaAzr

Hello again! @NdaAzr Let's look at the questions directly!

Regarding DARTS paper figure 1.b, is this a cell or stack of cells? if it's cell why there are three branches per node? and at the end in figure1.d why just it kept one branch however, based on the code in the cell that DARTS produce each node should pick 2 inputs.

It's definitely a cell, but it is NOT THE SAME as cells used to search. And there is one more thing to notice: it is not that DARTS keeps two operations on each edge but keeps two edges for each node.

How many nodes per cell we should expect? is it always return 4 nodes except output node?

For both normal and reduction cell, they have 6 nodes. NO.0-1 are inputs node, which are c_{k-1} and c_{k-2} as you mentioned in previous reply. And the rest, NO.2-5, are nodes used during searching process. The main difference is that input nodes CANNOT decide which previous cell to connect while the rest can decide which other node to connect(keep two edges for each node) and in what way (operations).

When we retrain the best cell, should we train it with the best cell learned found from first stage? or does it need to retrain from stack of the cells? for example if it found the genotype like below should I retrain it with the same genotype or stacked of genotypes? genotype = Genotype(normal=[('sep_conv_5x5', 0), ('max_pool_3x3', 1), ('dil_conv_3x3', 0), ('avg_pool_3x3', 1), ('dil_conv_3x3', 3), ('sep_conv_3x3', 0), ('dil_conv_5x5', 0), ('avg_pool_3x3', 3)], normal_concat=range(2, 6), reduce=[('skip_connect', 0), ('sep_conv_3x3', 1), ('skip_connect', 2), ('max_pool_3x3', 0), ('sep_conv_5x5', 0), ('sep_conv_5x5', 1), ('dil_conv_5x5', 1), ('skip_connect', 2)], reduce_concat=range(2, 6))

According to the paper, you should retrain them from stack of the cells. To decide how many cells to uses, you could check the layer parameter in train_imagenet.py. Also, you could check line 184-195 of model.py to see how to decide which cell is reduction and which is normal.

I can see in model_search.py line #63 stem_multiplier=3 which increase the number of initial layer from 16 to 48? Do you think is there any specific reason for that? and what step = 4 and multiplier=4 does in the code?

I've asked someone who has an research field in automl/nas. He told me that most of the hyperparameters used in DARTS are experience from past, i.e. using stem_mulitiplier=3 is simply because is is found good in previous CNN applied to CIFAR/ImageNet. And that is the same for other hyperparameter you mentioned. He also gives me a suggestion for applying DARTS to other dataset: try to build couple of baselines, and use those hyperparameter as your search space to explore new structures.

I am still not sure why DARTS return same genotype during all epochs for my dataset? It makes me worry that I have a mistake in the code! Is it happened in your case to get same genotype during all epochs? how first epoch with 13% acc could be same as epoch 10 with 30% accuracy? I can see that in this Github, they produce a gif animation that displays the cell is changing during all epochs. If I change the dataset size or number of dataset class would produce different genotype. However genotypes are similar during all epochs?

Don't worry! It is only used to keep track with the change of your structures. Also don't worry about the accuracy during searching process: they are simply useless. I think you could stop searching as long as your structure no longer change.

I have some last words to say: don't be too worried about structure generated by DARTS. If you read some following works after DARTS, many of them found that DARTS is very efficient to generate a model, but it is never guarantee to be optimial. Remember the guy I mentioned in lines above? He told me that this is an unsolved problem currenly faced by NAS: no matter what way you use, RL or GB, none of them could guarantee to always have an optimal structure. So, from a practical way, I would suggest you to 1) repeat couple of times of your searching process and find the best using retrain; 2) there are also some following works of DARTS and they are better than DARTS! i.e. I use PC-DARTS to search for a structure on a subset of celeb (1k people), and I got valid accuracy of 98.2

GL,

Aug 08 '19 01:08 Catosine

Thanks a lot for your explanation. All your explanations are really helpful. Greatly appreciated.

about question 1:

but it is NOT THE SAME as cells used to search.

Based on the code, there are 8 cells such as none, max_pool_33, avg_pool_33, skip_connect_,...etc , so does it means each of these cells are similar to the cell in figure 1.b? or a cell similar to figure 1.b with 8 edges?

What make me confuse is what displays in figure 1 is different with the best learned cell. For example if this is the best learned cell, why edges labelled as sep_con 5 * 5 or dil_con 5*5 which in code is a cell itself.

try to build couple of baselines, and use those hyperparameter as your search space to explore new structures.

I am not sure if I understand it correctly. Do you mean trying established models i.e. AlexNet, ResnNet18, and see what kind of hyper-parameters gives the best accuracy, and then apply those hyper-parameter to the DARTS?

about retrain the train, shall I use train_imagenet.py or train.py for my custom dataset?

I'll investigate the PC-DARTS as well.

Thanks again,

Aug 08 '19 15:08 NdaAzr

@NdaAzr Hi there!

Based on the code, there are 8 cells such as none, max_pool_33, avg_pool_33, skip_connect_,...etc

Okay, I see. I think we have to make sure we are talking the same stuff. For me: operations: conv, poolings, skip-connect, and so on; layers, cells: a directed acyclic graph as you showed to me; search space: the cell structure before any search edges: the arrow in the cell graph, which stand for one or a mixed operations node: where one or multiple edges meets, i.e. 0 in your pic network: stack of layers/cells

So, it is not 8 cells but 8 operations on each edge to be searched.

I am not sure if I understand it correctly. Do you mean trying established models i.e. AlexNet, ResnNet18, and see what kind of hyper-parameters gives the best accuracy, and then apply those hyper-parameter to the DARTS?

Yes. I am encouraging you to use those classic networks. To be more specific, I would recommend ResNet, for DARTS is more or less similiar to Res-structure. If DARTS beats your baseline, then you could conclude that this searched structure is valid for following works, i.e. finetune.

GL,

Aug 09 '19 06:08 Catosine

Hellow, @Catosine , I also train the code on my ultrasound dataset. My question is that the input of (32,32) is too small for me. So i set to (224,224). Then i train with the imageNet code, the model is just 32MB. So can you give my some advice to increase the model as well as the performance? E.g add more cells? I noticed that there are just two reduce cells. Maybe i need to add more?

Aug 23 '19 12:08 ray-lee-94

Hello ! @VCBE123

I'm glad to hear from you! From my experience, you should not change the search space before comprehensive experiments. That is to say, you should try search couple of models with different hyperparameters, i.e. learning rate for architecture (but not number of cells or kind of operations and so on). After that, you could tell the upper bound of performance of the search space, and then start to modify the search space based on your experience.

As for the size, I think it does not matter as long as it beats the best human-design model you have. According to my experience, my searched model beats ResNet-18 with 1/3 of the size on 1k-people faceid task.

If you already done those stuff above, I would suggest you simply to increase the number of cells for retrain. But there is no guarantee that the searching result for a small network (i.e. 8 cells used by DARTS) is the optimal structure for large network (i.e. 20 cells used in retrain by DARTS). There are some following works about DARTS, such as PDARTS and PCDARTS, which the authors put their codes on github. These work are very good and beats DARTS, according to the authors. These works notice the gap between searching and retraining, so I think they may be helpful for you. And they developed their work based on DARTS, so it won't be very difficult to understand, especially when you have some experience with DARTS.

GL,

Aug 23 '19 13:08 Catosine

Hi @Catosine

Thank you for all your explanation so far. I have some questions as follows:

I was worry that why model.genotype() in train_search.py is not evolving and it's similar during all epochs for my ultrasound dataset, so I decided to try it for CIDFAR10, and see if the model evolves or not as the gif animation in GitHub displays is evolving. So, I didn't change hyperparameters and ran DARTS for CIFAR10 dataset only for 300 indices for train and 300 indices for validation with the batch size of 1 as I will get memory GPU error if I tried for a batch size more than 1. I could see that during 25 epochs it’s not evolving at all, and model.genotype() is the same while acc is increasing. What do you think? could you please give me some advice about it?
I assume the best-learned cell is what prints from model.genotype() in train_search.py. Is it correct? here the log for 25 epochs is attached.
What kind of GPU did you use to run the DARTS for CIFAR10? and which epochs have started to evolve?
why all scripts are saved during the search in different folders? Shouldn't save only logs?

Many thanks, Neda

Aug 27 '19 10:08 NdaAzr

@NdaAzr Hi there!

Would you mind show me the genotypes? That would be great! I have no idea about your case now. It might be that the algorithm think the initial structure is the best one. On the other hand, it is possible that the search space simply does not have an good model for your task. Also please check the data. Make sure there is no problem with data loading process.
Sorry I’m currently unable to cannot access your Dropbox link. Don’t worry I will be able to access it couple of days later.
I use one Nvidia V100 to search for model and 8 of them to do retrain. But I think this does not matter.
That is to make a “screenshot” of your codes in case you change it later. It would be easier to reproduce for others.

GL, PF

Aug 28 '19 14:08 Catosine

@Catosine Thank you for your advice and explanation. The PDARTS fits well on my dataset. I have a question about the update of the arch_parameters when unrolled=False.

def _backward_step(input_valid,target_valid):
    loss=self.model._loss(input_valid,target_valid)
    loss.backward()

Aug 29 '19 07:08 ray-lee-94

Hi @Catosine

Thank you for your reply.

Would you mind show me the genotypes?

Here is the genotype found during 25 epochs for CIFAR10 dataset (means model.gentype() was same for 25 epochs).

Genotype(normal=[('dil_conv_5x5', 0), ('sep_conv_5x5', 1), ('dil_conv_5x5', 0), ('avg_pool_3x3', 2), ('max_pool_3x3', 2), ('sep_conv_3x3', 3), ('sep_conv_3x3', 1), ('avg_pool_3x3', 0)], normal_concat=range(2, 6), reduce=[('max_pool_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 0), ('dil_conv_5x5', 2), ('dil_conv_5x5', 2), ('dil_conv_5x5', 3), ('sep_conv_5x5', 3), ('max_pool_3x3', 1)], reduce_concat=range(2, 6))

3. I use one Nvidia V100 to search for model and 8 of them to do retrain. But I think this does not matter.

what was your dataset size and image size as well?

Also, if I set --unrolled = True, I will get this error RuntimeError: One of the differentiated Tensors does not require grad , could you please let me know how did you solve this error?

fill error is here: ` File "G:\NAS\DARTS\DARTS_echoview_classification\cnn_cifar10\train_search.py", line 140, in train architect.step(t_image, target, input_search, target_search, lr, optimizer, unrolled=args.unrolled)

File "G:\NAS\DARTS\DARTS_echoview_classification\cnn_cifar10\architect.py", line 34, in step self._backward_step_unrolled(input_train, target_train, input_valid, target_valid, eta, network_optimizer)

File "G:\NAS\DARTS\DARTS_echoview_classification\cnn_cifar10\architect.py", line 50, in _backward_step_unrolled implicit_grads = self._hessian_vector_product(vector, input_train, target_train)

File "G:\NAS\DARTS\DARTS_echoview_classification\cnn_cifar10\architect.py", line 82, in _hessian_vector_product grads_p = torch.autograd.grad(loss, self.model.arch_parameters())

File "C:\Users\User\Anaconda3\lib\site-packages\torch\autograd_init_.py", line 149, in grad inputs, allow_unused)

RuntimeError: One of the differentiated Tensors does not require grad`

Many thanks, Neda

Aug 29 '19 12:08 NdaAzr

@VCBE123 Hi there! Sorry for reply late.

Here is the thing: DARTS raises two ways to approximate the gradient of the architecture parameters, aka. "alpha". The first way (called as "first order way" in the paper) is to calculate the gradient of weights and alphas independently. That is to say, to treat the weight and the alphas as independent variables. However, it is not true. Weights are dependent with alpha, for the latter will decide which kind of operations is selected. So the authors think that it will be more accurate to calculate the gradient of weights from the gradient of alpha, which requires to compute the second order derivatives of architecture. That is why there is Architect class (called as "second order way"). However, it is very time-consuming if using the built-in function in PyTorch to calculate the second order derivatives, therefore, DARTS applies an approximation when calculating the gradient of weights. In general, the second order way can generate models with higher accuracy but with longer searching time as a penalty, compared with the first order way.

And as you pointing out, parameter unrolled is used to decide whether to solve the optimization problem in which way. If it is false, then goes with first order way.

All above are based on my limited understanding. There might be some details that are not correct. Sorry about that. You may check the Section 2.3 in the DARTS paper.

Feel free to communicate with me!

GL, PF

Aug 31 '19 09:08 Catosine

@NdaAzr Hi! Sorry for replying late.

Your genotype looks sense to me. I've no idea whether it is good or bad. The only way to tell is to retrain. But please know that the searching result will vary greatly. Please try with repeated search with same settings and compare alll the models.

For the second question, I used a cleaned dataset of Microsoft 1M celeb data. (Faces in the center, 112x112 in shape, three channels). However, I did this on PDARTS/PCDARTS, and my best result is from PDARTS, with searching batch size of 32 and retrain batch size of 128 with SGD and 0.1 as learning rate. I'm not sure if this works on your data. But if you still cannot reproduce the result on CIFAR-10, I would suggest you to double check the codes. Make sure you understand them in detail and there is no bug.

For your last question, I cannot tell what exactly is wrong. Please check all variables, (weights, alphas, input data, input label) and make sure they all have requires_grad = True.

GL,

Aug 31 '19 09:08 Catosine

Hi @Catosine Thanks a lot for the explanation. genotype.model() is evolving now for my dataset when --unrolled = true otherwise will give me the same genotypes. How can I justify that? what could be the reason? At the moment is running for my dataset. Hopefully, it can perform better than ResNet1 or Vgg16.

make sure they all have requires_grad = True.

Is input and label both should have requires_grad = True? Then, do you know why in github they set requires_grad = False?

Many thanks, Neda

Sep 03 '19 14:09 NdaAzr

@NdaAzr Hi!

Thanks a lot for the explanation. genotype.model() is evolving now for my dataset when --unrolled = true otherwise will give me the same genotypes. How can I justify that? what could be the reason?

That sounds weird. Please double check the code. There might be some bugs.

Is input and label both should have requires_grad = True?

I think that is only for input.

GL, PF

Sep 03 '19 15:09 Catosine

Hi @Catosine ,

Could you please give a brief explanation of what unrolled does here? should it be always False?

Sep 03 '19 15:09 NdaAzr

Hi @NdaAzr ,

As I look deeply into the code, I realize that if you use unrolled=False it means that you optimize architecture parameters by first-order approximation. They already mentioned it in the paper First-order Optimization

Oct 31 '19 14:10 giangtranml

Hi @Catosine

Thank you for your reply.

Would you mind show me the genotypes?

Here is the genotype found during 25 epochs for CIFAR10 dataset (means model.gentype() was same for 25 epochs).

Genotype(normal=[('dil_conv_5x5', 0), ('sep_conv_5x5', 1), ('dil_conv_5x5', 0), ('avg_pool_3x3', 2), ('max_pool_3x3', 2), ('sep_conv_3x3', 3), ('sep_conv_3x3', 1), ('avg_pool_3x3', 0)], normal_concat=range(2, 6), reduce=[('max_pool_3x3', 1), ('skip_connect', 0), ('sep_conv_3x3', 0), ('dil_conv_5x5', 2), ('dil_conv_5x5', 2), ('dil_conv_5x5', 3), ('sep_conv_5x5', 3), ('max_pool_3x3', 1)], reduce_concat=range(2, 6))

I use one Nvidia V100 to search for model and 8 of them to do retrain. But I think this does not matter.

what was your dataset size and image size as well?
1. Also, if I set --unrolled = True, I will get this error `RuntimeError: One of the differentiated Tensors does not require grad` , could you please let me know how did you solve this error?
fill error is here: ` File "G:\NAS\DARTS\DARTS_echoview_classification\cnn_cifar10\train_search.py", line 140, in train architect.step(t_image, target, input_search, target_search, lr, optimizer, unrolled=args.unrolled)

File "G:\NAS\DARTS\DARTS_echoview_classification\cnn_cifar10\architect.py", line 34, in step self._backward_step_unrolled(input_train, target_train, input_valid, target_valid, eta, network_optimizer)

File "G:\NAS\DARTS\DARTS_echoview_classification\cnn_cifar10\architect.py", line 50, in _backward_step_unrolled implicit_grads = self._hessian_vector_product(vector, input_train, target_train)

File "G:\NAS\DARTS\DARTS_echoview_classification\cnn_cifar10\architect.py", line 82, in _hessian_vector_product grads_p = torch.autograd.grad(loss, self.model.arch_parameters())

File "C:\Users\User\Anaconda3\lib\site-packages\torch\autograd__init__.py", line 149, in grad inputs, allow_unused)

RuntimeError: One of the differentiated Tensors does not require grad`

Many thanks, Neda

Hi @NdaAzr, Have you solved the problem 3, "RuntimeError: One of the differentiated Tensors does not require grad"?

Dec 09 '19 21:12 PipiZong

@PipiZong Hello, thanks for your reply.

The model looks fine to me. Did you try to retrain them on the ImageNet? How are they going? I cannot remember the exact dataset size, for I've done it couple of month ago on a private dataset.

GL,

Dec 11 '19 00:12 Catosine

Hellow, @Catosine , I also train the code on my ultrasound dataset. My question is that the input of (32,32) is too small for me. So i set to (224,224). Then i train with the imageNet code, the model is just 32MB. So can you give my some advice to increase the model as well as the performance? E.g add more cells? I noticed that there are just two reduce cells. Maybe i need to add more?

hello,

Did you successfully run DARTS with images of size (224,224)? Could you please tell me your running envs, like pytorch version or gpu memory?

I came across gpu out of memory with a single TITAN X, whose memory is 24GB. I tried to wrap the validation part with torch.no_grad() as mentioned in other issue, but it does not work. I also decrease the batch_size to 1, but still in error.

Jan 18 '21 01:01 rrryan2016

darts darts copied to clipboard

Doubt about the effectiveness of the method

darts
darts copied to clipboard