Deprecated: Assigning the return value of new by reference is deprecated in /f2/blendedtechnologies/public/wp-content/plugins/pmetrics.php on line 1192
Blended Technologies » Blog Archive » Removing Duplicate MP3’s with Python - a Naive Yet Fuzzy Approach

Removing Duplicate MP3’s with Python - a Naive Yet Fuzzy Approach

Update: I just discovered this little gem of program that does all of what I do here and more, and has a nice GUI interface.

Here’s a short Python script I put together to find potentially duplicate MP3 files and move them to a seperate folder for further inspection. Go ahead and give it a try. Paste the code below into a new text file, save with a .py extension and give it a run. (Python needed of course)

#A program to find duplicate files comparing filenames
import os
import shutil
import difflib
#-SETTINGS--------------------------------------------------------------------
SearchFolder=r'C:My Music'
#-----------------------------------------------------------------------------

TempList=os.listdir(SearchFolder)#Read in all files in folder

try:
    os.mkdir(os.path.join(SearchFolder,'Suspected Duplicates'))
except OSError, why:
    if why.errno==17:pass # already exists
    else:raise why.strerror
#format and filter names
AllFiles=[os.path.splitext(AFileName)[0].strip().lower() for AFileName in
          TempList if os.path.splitext(AFileName)[1].lower()==".mp3"]

#Compare every filename to every other, yes this is O(some big thing)
for i,AFileName in enumerate(AllFiles):
    if i%50==0:print 'examining file %s, %s left' %(str(i),str(len(AllFiles)))
    AllFiles.remove(AFileName)
    CloseMatches=difflib.get_close_matches(word=AFileName,
        possibilities=AllFiles,n=20,cutoff=0.8)
    if CloseMatches:
        [AllFiles.remove(filename) for filename in CloseMatches]
        #Move these files to the new folder
        CloseMatches.append(AFileName)
        for filename in CloseMatches:
            orig_path=os.path.join(SearchFolder,filename)+os.path.extsep+"mp3"
            new_path=os.path.join(SearchFolder,'Suspected Duplicates',filename) \
                +os.path.extsep+"mp3"
            try:shutil.move(orig_path,new_path)
            except:print 'couldna move',orig_path

It is naive because it only looks at simularities in file names. I find this approach good enough, but you could always add code to compare file sizes, MD5 sums, or even ID3 tags. This approach is fuzzy because it doesn’t require an exact match in the filenames. I find most of my duplicates have similar yet slightly different names and this catches a lot of them.

del.icio.us |  Digg |  FURL |  Yahoo! My Web 2.0 |  Reddit

2 Responses to “Removing Duplicate MP3’s with Python - a Naive Yet Fuzzy Approach”

  1. Tim Almond Says:

    I guess this could be modified to take in more file types like images, and add in MD5 sums. Would you mind if maybe I did this sometime? I’ll gladly share the changes back with you.

    Also, maybe I could put a GUI on it?

    Tim

  2. Greg - CEO/Founder Says:

    Tim,

    You’re more than welcome to improve it any way you see fit. Another fellow on the Python mailing list is also getting involved. I’m tempted to set up a Trac page/wiki for working on it. It’s definately worth doing if you are going to add a GUI. Let me know if you think that would be useful.

    -Greg