I was panic-stricken. Finally, the script I’d written to rewrite the entirety of Grooveshark’s music catalog had completed after days of fix-it-and-try-again intervention. But now there were thousands of missing songs, songs which were associated with the wrong artists, empty albums, and other such problems and I didn’t know what had gone wrong. The only thing I could do was put my nose to the grindstone and fix the mess I’d made because there wasn’t time to do anything else.
It was 2008, and Grooveshark was a peer-to-peer (P2P), paid music download service. The music shown on the site was the aggregation of the MP3s located on our users’ computers. Using our desktop app, Sharkbyte, users made their MP3 collections available for sale on the service. When a user purchased and downloaded a song, a portion of the revenue from the sale was credited to the user who provided the MP3 file. When Sharkbyte scanned a user’s MP3s to add them to the site, it extracted the song, artist, album, genre, and other information, called metadata, and saved it to the Grooveshark servers. The effect of this was that the spelling, capitalization, punctuation, and the information itself, correct or not, of our music catalog, was completely dependent on the quality of the data on our users’ computers.
At this time, Grooveshark’s database made no distinction between a song, in the sense of a recorded work by a musical artist, and an MP3 copy of it. 128, 192, and 320 kbps versions of the same song were treated as three different recorded works. This is because we used an algorithm to create a short, text fingerprint of each MP3 file, known as its checksum, and used this as each song’s unique identifier in our database. MP3s which were encoded at different bitrates, or the same bitrate by different CD rippers, would be recognizable as the same song to the human ear but would be represented as separate “songs” in Grooveshark’s database. In practice this meant that the same “song” was represented on Grooveshark hundreds or thousands of times.
Searching Grooveshark for …Baby One More Time would yield scores of slightly dissimilar results. Users’ individual libraries would show incorrect information too, because Grooveshark only compared checksums when building user libraries and would ignore most users’ own metadata, in favor of the metadata of the first person to have scanned a song. If an earlier Grooveshark user thought that your copy of A Horse With No Name was recorded by Neil Young, not America (a common mistake), it would appear this way, misattributed, in your Grooveshark library.
From a technical perspective, this “metadata problem,” as we would come to call it, was understandable. But it made for a terrible user experience. At first we attempted different filtering strategies to deduplicate song listings on the site but ultimately we knew that we needed to change the situation in a fundamental way.
We decided to restructure our database to store file and song information separately by creating a one-to-many relationship between a song and it’s multiple MP3 files. Each existing “song” record would be converted a “file”, and a new, parent song would be created to which all of these new file records would be linked. We called this project “File/Song.” Grooveshark Lite, which was early in development at this time, would be based on the File/Song database and would benefit from these organizational improvements.
One of the strategies that I used to find duplicate versions of songs was to normalize every song’s title by converting it to lowercase and removing all whitespace and punctuation. This would allow me to compare …Baby One More Time and Baby One More Time, and along with normalized artist and album names, recognize that these two files were actually the same song. In addition, I used a well-known text-comparison algorithm to calculate the similarity of pieces of metadata. This would let me compare Britney Spears and Brittney Spears, for example, and know that these were potentially the same artist. I also compared our data to a high-quality open source music database, and used it to identify which of our records were accurate.
With these strategies in hand, the first step to deduplicating Grooveshark’s songs was to deduplicate all artists and albums. After that, all “song” records were converted to “file” records, and new song records were created from the deduplicated file metadata. Once this was done, I updated all user libraries, playlists, and favorite song lists, to point to the new records. I also used this opportunity to restructure other aspects of Grooveshark’s database: File/Song would give us the ability to have songs appear on multiple albums, for albums to have multiple collaborating artists, and other such changes that better mapped our database to the real-world of the music industry.
After designing the new database, I wrote hundreds of SQL queries in a script which would convert our existing database to the new format, along with several helper programs. We created a separate version of the Grooveshark website and rewrote much of its backend to be compatible with the new design. The song, artist, and album pages, user libraries, search features, and many other things had to be changed because of the changes that I was making to the database. To keep from having to take Grooveshark offline while this conversion process was running, we created a new, nearly read-only mode for the website. When enabled, users could browse Grooveshark’s music catalog, preview and even purchase songs, but they wouldn’t be allowed to add new music to their libraries while the process was running.
The project was progressing but I was way off my original estimate for its completion date. As the File/Song project slipped further and further behind schedule, Grooveshark Lite was nearing closer to release. Since Grooveshark Lite had been written to use this new database, the pressure was on and the clock was ticking. Working long, odd hours wasn’t uncommon at Grooveshark, but I worked especially long and odd hours, including weekends and overnight, to complete the project.
… Take a look through the AC filesong buglist and fix anything you
can–I’m sure there are still a few open bugs in there. We have to
absolutely make sure that the signup/account request/invite/request
manager (app/reqmgr) process works without error.
Oh, if you can, move the filesong database from 172.16.1.7 (box under
my desk) to RHL011 (dev server) so that the and Lite guys
can test their stuff.
I know this is all small stuff, and I’d ask you to look into the data
issues on the [new database], but all that stuff is jumbled up in my head and
would take an hour or so to type up exactly what went wrong (what I
think went wrong) and what’s left to be done (not much I think)–I
think it’s under control, and if it is, we’re good, and damn it sucks
if it isn’t.
Excerpt from an email sent to Travis on Apr 4, 2008 at 6:40 AM.
Finally, after several months of development and testing, I was ready to start the conversion process and build the new database. We put Grooveshark into read-only mode and created a copy of our production database server that my scripts would operate on. When this process was complete, we’d point Grooveshark to this new server and take it out of read-only mode. I connected to it and kicked everything off. The first few scripts completed quickly as expected, but the script which contained all the SQL queries, which was the bulk of the process, would take several days to complete. Every few hours I’d check-in on the server to to see if was still running. Many times, it wasn’t.
Frequently, the script would crash with a new error that I’d never seen before. I’d have to write new queries to work around the problem, comment-out everything in the script which had already run, and try again. This went on for several days, all while Grooveshark was still in read-only mode. One evening, I connected to the server to check on its status and saw that it had stopped and that the server had been restarted. I came in to the office to see what had happened to discover that Colin had accidentally restarted it. In retrospect, it’s hard to believe that we were okay with doing this work in this way, but this was 2008 and we were all young and inexperienced, and we didn’t know that a different approach was even possible.
Understandably, Josh was growing nervous and impatient about when the conversion would be complete. I began sending daily status emails to the whole company, describing the progress so far, the problems I’d encountered, and an estimated time when it would be finished and Grooveshark would be fully back online. Most of these emails concluded with the promise that “File/Song will be done tomorrow,” which quickly became a painful joke.
Finally, the script finished. Relieved and excited, I ran some queries by hand to sanity-check the quality of the data. Most things were correct, but I soon found that there were thousands, possibly tens of thousands of missing or wrong database records. I ran more queries and found more problems. My first thought was that I’d have to try it all over again, but the process took days to complete and Grooveshark Lite was scheduled to release in less time than I’d need to spend debugging it, not to mention the thought of having to keep the site in read-only mode for several more days, slowing our growth when we needed it the most.
I couldn’t panic. I had to fix the problem and fast. Thankfully, it was the weekend and the office was empty so I had nothing to distract me. I sat at my desk and wrote scores of SQL queries to identify and repair the data. I don’t remember how long I was awake working, but it was a long time. At one point I took a break and sat on one of the big, ugly, orange couches that we had in the middle of the office. Located on an ottoman was The Dictionary of Dreams and Their Meanings. I was exhausted and all I wanted to do was sleep. I snapped a picture with my iPhone, uploaded it to Flickr, and captioned it “Irony.”
Finally, after a weekend of around-the-clock work, I fixed all of the data in the new File/Song database. I never discovered what had gone wrong but I fixed it anyway. Everything checked out, the crisis was averted, and just in the nick of time. The next day, we deployed the new version of Grooveshark which had been rewritten to work with the new database, and I went home to sleep. A few days later, on April 15, 2008, we released Grooveshark Lite.
Some time later, Josh had t-shirts printed to commemorate the completion of File/Song and the launch of Grooveshark Lite. The Grooveshark Lite shirt read “I Survived the Launch of Grooveshark Lite” and on the File/Song shirt was printed “File/Song will be done Tomorrow.” It was a good-natured jab.
Ultimately, the File/Song database was a success, as it organized all of our data in a way that allowed us to release all of the features that we would build as time went on, but it didn’t solve all of our metadata problems. These would continue to plague Grooveshark for the remainder of its existence and we would attempt various strategies, to various degrees of success, over the years to correct it.