internetarchive-downloader
internetarchive-downloader copied to clipboard
Add suppression of duplicate files
I had a problem on some downloads where it would grab multiple copies of the same file in different formats. I wrote the following and added it so it would only download one for each filename. It is hierarchical in that the first filename to be downloaded will exclude all the rest. This is based on which extension is first in the list passed. You can add this code to yours and modify it to suit your format and desires. I just find it useful to not have 5 or six copies of the same file. You could also add a switch to display the file usage as a preview or even display each file to be downloaded.
def add_file_names(download_queue,filenames,allf,ext,sizes): """ This coded added by Rusty Davis
add_file_names - takes a list <download_queue>, a list of previously found filenames <filenames>
and an file extension type will search <download_queue> to see for each if the
filename matches the extension type and if it is already in the list adding
those that are not in the list. Returns a list of all currently known filenames
and a list with just this <ext>
download_queue - list of file details including the filename
filenames - list of filenames
ext - extension to search for
sizes - contains the total size for each extension
flist - list of filenames with <ext> NOTE: this is not used anymore
returns - list of non duplicate filenames and a list of the filenames with extensions;
count is the number of files added
"""
extLen = len(ext)
count =0
sizes[ext]=0
for i in download_queue: # for each element
name = i[1] # get the file name
if ext not in name: continue # skip it if it is not an mp4 file type
fname = name[:-extLen] # remove the extension from the file name
if fname in filenames: continue # skip names already listed
count+=1
filenames.append(fname) # if not already in the list, Add it
allf.append(name) # add full filename including extension
sizes[ext] +=i[2] #
return filenames,allf,count,sizes
def removeIADups(download_queue,exts): """ This coded added by Rusty Davis
removeIADups - removes any dups that exist in the download queue that match the extensions;
checks from first <exts> to end in order
<download_queue> is a list in the form
[element,element,...]
element = (IA_search,filename,size,checksum,xx,save directory,...)
<exts> is a list of valid file extensions to allow. Each extension takes priority based on
order; First being highest
"""
allf=[] # a list of all filenames we are to process
filenames = [] # a list of all non duplicate filenames
total = 0
dupcount = []
sizes={}
for i in exts: #
filenames,allf,count,sizes = add_file_names(download_queue,filenames,allf,i,sizes)
dupcount.append(count)
new_queue = [] #
quefilename = 1
quefilesize = 2
fileTotal = 0
j=0
for i in exts:
print(i,dupcount[j],'\t',f'{sizes[i]:,}')
j+=1
for i in download_queue: # for each file to download
name = i[quefilename] # get the file name
if name in allf: # if this is one of the selected
fileTotal +=i[quefilesize] #
new_queue.append(i) # add to the new queue we are building
print("New",len(new_queue),'\t',f'{fileTotal:,}')
input("EXT LIST")
return new_queue
The following is inserted right above the file filters section of the download routine
exts = ['.doc','.docx','.pdf','.txt','.rtf','.djvu','.epub','.mobi','.cbz'] download_queue = removeIADups(download_queue,exts)
if file_filters is not None: