DeepLearning icon indicating copy to clipboard operation
DeepLearning copied to clipboard

Need Help for developing Word2Vector Explanation in Tensrflow

Open prakritidev opened this issue 7 years ago • 3 comments

I'm not able to understand the generate_batch function in the word2vector. If someone knows how is this function working please let me know or can explain that function in the script

Thanks

prakritidev avatar Jul 06 '17 15:07 prakritidev

i will try to get some free time to explain these for you but right now i am so busy

mamonraab avatar Jul 09 '17 10:07 mamonraab

@prakritidev : Try this piece of code. If you understand skip-gram this can be understood. I feel examples in tensorflow are badly documented. I added comments to connect it with the process of skip-gram. Hope it helps.

num_skips=2 # no of words to be picked from the window
skip_window=1 #define how much will we see on one side of the word
batch_size = 16 

data_index = 0 #global circular counter over the data

def generate_batch(batch_size, num_skips, skip_window):
    # skip window is the amount of words we're looking at from each side of a given word
    # creates a single batch
    
    global data_index

    # num_skips => # of times we select a random word within the span? so no of picks should be integer
    assert batch_size % num_skips == 0
    assert num_skips <= 2 * skip_window # maximum no of samples picked is size of span

    batch = np.ndarray(shape=(batch_size), dtype=np.int32)
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
    
    # e.g if skip_window = 2 then span = 5
    # span is the length of the whole frame we are considering for a single word (left + word + right)
    # skip_window is the length of one side
    
    span = 2 * skip_window + 1 # [ span defines the whole window, which is 2 * skip_window + the word itself ]
    
    # queue which add and pop at the end
    buffer = collections.deque(maxlen=span)
    
    #print "span = %d" %span
    
    #get words starting from index 0 to span
    for _ in range(span):
        #print "_ = %d" %_
        #print "data_index = %d" %data_index
        buffer.append(data[data_index]) # fill the buffer with elements in window
        data_index = (data_index + 1) % len(data)  #this is just to circle at the end of text corpus


    # num_skips => # of times we select a random word within the span
    # batch_size (8) and num_skips (2) (4 times)
    # batch_size (8) and num_skips (1) (8 times)
    
    #denotes the number of (input, output) pairs generated from the single window: [skip_window target skip_window]. 
    #So num_skips restrict the number of context words we would use as output words.
    
    #since num_skips = # of elements picked in each window, 
    # of windows = batch_size // num_skips
    # we iterate - for each window (i)
    #                   for each pick in given window
    #                            fit the pick in the batch
    
    # to fit the pick in the batch : jth element in ith pick = i * num_skips + j
    
    # from each window, we pick #num_skips elemnts, so to make a batch, how many windows we need ?
    num_of_windows = batch_size // num_skips 
    
    for i in range(num_of_windows):
        target = skip_window  # target label at the center of the buffer
        targets_to_avoid = [ skip_window ] # we only need to know the words around a given word, not the word itself
        
        for j in range(num_skips):
            while target in targets_to_avoid:
                # find a target word that is not the word itself
                # while loop will keep repeating until the algorithm find a suitable target word
                target = random.randint(0, span - 1)
                
            # add selected target to avoid_list for next time
            targets_to_avoid.append(target)
            
            # e.g. i=0, j=0 => 0; i=0,j=1 => 1; i=1,j=0 => 2
            batch[i * num_skips + j] = buffer[skip_window] # [skip_window] => middle element
            labels[i * num_skips + j, 0] = buffer[target]
            
        #populate the buffer with elements of next window - which is one elemnt on right
        buffer.append(data[data_index])
        data_index = (data_index + 1) % len(data)
        
    return batch, labels
  

Do let me know if u face any issue

anujgupta82 avatar Jul 09 '17 18:07 anujgupta82

I'm still confused, but much more clear

TorosFanny avatar Nov 15 '17 09:11 TorosFanny