internetarchive-downloader icon indicating copy to clipboard operation
internetarchive-downloader copied to clipboard

Add suppression of duplicate files

Open rdavis59 opened this issue 1 year ago • 0 comments

I had a problem on some downloads where it would grab multiple copies of the same file in different formats. I wrote the following and added it so it would only download one for each filename. It is hierarchical in that the first filename to be downloaded will exclude all the rest. This is based on which extension is first in the list passed. You can add this code to yours and modify it to suit your format and desires. I just find it useful to not have 5 or six copies of the same file. You could also add a switch to display the file usage as a preview or even display each file to be downloaded.

def add_file_names(download_queue,filenames,allf,ext,sizes): """ This coded added by Rusty Davis

add_file_names  - takes a list <download_queue>, a list of previously found filenames <filenames>
                  and an file extension type will search <download_queue> to see for each if the 
                  filename matches the extension type and if it is already in the list adding 
                  those that are not in the list.  Returns a list of all currently known filenames 
                  and a list with just this <ext>
                  
download_queue  - list of file details including the filename
filenames       - list of filenames
ext             - extension to search for
sizes           - contains the total size for each extension

flist           - list of filenames with <ext> NOTE: this is not used anymore

returns         - list of non duplicate filenames and a list of the filenames with extensions; 
                  count is the number of files added
"""
extLen = len(ext)
count =0
sizes[ext]=0
for i in download_queue:                    # for each element
    name = i[1]                             # get the file name
    if ext not in name: continue            # skip it if it is not an mp4 file type
    fname = name[:-extLen]                  # remove the extension from the file name
    if fname in filenames: continue         # skip names already listed
    count+=1
    filenames.append(fname)                 # if not already in the list, Add it
    allf.append(name)                       # add full filename including extension
    sizes[ext] +=i[2]                       #
    
return filenames,allf,count,sizes

def removeIADups(download_queue,exts): """ This coded added by Rusty Davis

removeIADups - removes any dups that exist in the download queue that match the extensions;
checks from first <exts> to end in order

<download_queue> is a list in the form

[element,element,...]
element = (IA_search,filename,size,checksum,xx,save directory,...)

<exts> is a list of valid file extensions to allow.  Each extension takes priority based on 
order;  First being highest

"""

allf=[]                                     # a list of all filenames we are to process
filenames = []                              # a list of all non duplicate filenames
total = 0
dupcount = []
sizes={}
for i in exts:                              # 
    filenames,allf,count,sizes = add_file_names(download_queue,filenames,allf,i,sizes)
    dupcount.append(count)

new_queue = []                              #
quefilename = 1
quefilesize = 2
fileTotal = 0
j=0
for i in exts:
    print(i,dupcount[j],'\t',f'{sizes[i]:,}')
    j+=1
for i in download_queue:                    # for each file to download
    name = i[quefilename]                   # get the file name
    if name in allf:                        # if this is one of the selected
        fileTotal +=i[quefilesize]          #
        new_queue.append(i)                 # add to the new queue we are building

print("New",len(new_queue),'\t',f'{fileTotal:,}')

input("EXT LIST")  
return new_queue

The following is inserted right above the file filters section of the download routine

exts = ['.doc','.docx','.pdf','.txt','.rtf','.djvu','.epub','.mobi','.cbz'] download_queue = removeIADups(download_queue,exts)

if file_filters is not None:

rdavis59 avatar Jun 29 '23 15:06 rdavis59