Discogs-dump-parser How can we adapt the code to extract find info related to artist name or album?

Oct 13 '20 18:10 adelchi91

@mjb2010 By the way, thanks a lot for the code. I reckon it could even be better, if we could provide a code which gives access to all the data by inputing artist name or album. Perhaps we could then transform the data into a pandas dataframe. What do you think?

Oct 13 '20 18:10 adelchi91

The main thing I think people want to do with data dumps is import them into a database. My code could be used as a foundation for that. For example, as each release element is completely parsed, a callback function is triggered so that something (whatever you want) can be done with that chunk of data, such as further parsing it to dig into the info, and loading each value into the appropriate tables, panda DataFrames, whatever.

What I use the dumps for is slightly different; I just occasionally want to get a list of IDs of releases which have some particular characteristic, e.g. "all releases with a poorly formatted date". So my callback function would just look for that characteristic, and if found, spit out the release ID and date.

In other words, discogs-dump-parser is best for making one complete sweep through a data dump (however long that takes), doing something with whatever the user has defined as an XML element of interest, and then doing something with it, such as testing it for further relevance and outputting something in response.

What you are asking for could be done this way, but it would be rather painful. For example, you could blindly check each release to see if it has the artist or album title you want. Or you could start in the artists XML and look for the artist you want, and that could give you a list of releases (or master releases), and then you'd have to do another scan of the release XML and master release XML to get the info about each of those releases. Either way, it would be terribly inefficient, as compared to querying a real database, which would have indexes and would not waste time scanning through each and every item.

My advice is to think about what kind of queries you want to make, and then what kind of database you'd like to be able to run those queries against (pandas DataFrames?), and only use my code to initially populate that database, not to run the actual queries.

Oct 14 '20 22:10 mjb2010