JamBrain icon indicating copy to clipboard operation
JamBrain copied to clipboard

Migration of old ludumdare.com content

Open Cerno-b opened this issue 2 years ago • 11 comments

I would like to check out what the current plans are for migrating the content of the old website to the current one.

As I understand, the old website has been taken off but archived, so the backend data is still available offline somewhere.

Since this has not been done for a long time now I assume there is a roadblock somewhere that would prevent the migration of the old data.

Can anyone get me up to date on what the reasons are that this has not been attempted?

If the reasons are of a technical nature, I would like to offer to spend some time on the issue since I would really like to see my (and other people's) old stuff as part of the current website, to showcase the long history Ludum Dare has.

So if the main problem is to align the old database content with the new database structure, I would be able to offer some background in scripting and SQL, as well as some experience with handling large amounts of data (I'm working in machine learning and have a hand in organizing our data)

I assume there would be some things that need to be decided, like how to merge accounts with different usernames and other open points.

So if anyone with knowledge on the issue could let me know if this is still on the roadmap and possibly point out a few problems and roadblocks that need be removed, I'd love to dive into this and see whether I can make some headway.

Cerno-b avatar Aug 04 '22 08:08 Cerno-b

Hi @Cerno-b. Like most things, the main roadblock is time, me needing to prioritize other things first, and that there hasn't been much interest from others to tackle this issue. That said this is something I've penciled in for the winter (assuming I can afford to work on it).

Here's a further breakdown of some of the sticky points if you are anyone are interested in helping out.

  • User and post data is stored in an older "long unsupported" version of a WordPress dataabase
  • Game data is stored in 2 different formats (https://github.com/LudumDare/ludumdare-2008)
    • as blog posts with some special tags (identifying that it's a game but nothing about it)
    • in a proprietary format with actual game details
  • Blog and game page data is written in HTML, but the current page and game data is written in Markdown (not 100% Commonmark compliant, but the goal is to make it compliant in time)
    • We will not support HTML on the new website, and the formatting must be converted to Commonmark (or discarded)
  • YouTube embeds worked differently across websites
  • External image embeds are no longer supported (for security and performance reasons)
  • Other embeds (iframes) are not supported
  • New website needs to support "other" events (right now it only supports Ludum Dare events as they're currently run, no classic LD or MiniLD style events)
  • Because not all data has a matching account, we're going to need dummy/proxy accounts to own much of the data, and a procedure to claim games that belong to you
  • GDPR removal requests

I'll update this if I think of anything else, but that's the gist of things.

If I was to pick a big sticking point for me, it would be the HTML -> Commonmark conversion. There are some existing PHP libraries that support this, but they don't do a great job in my testing. The process of converting the data over, as I see it, will be to write some new code that parses the source Post and Game data, spits out tags that it doesn't know what to do with, and we "rinse and repeat" adding support for some of the funny things people did until most of the posts are handled. For example, YouTube embeds I believe were done with <iframe>'s, so we can use some clever logic to extract those. Other embeds should just be turned into normal links. Some users also went <div> happy or used HTML tables, and it would be nice to see those handled in some reasonable way. Sadly some people did custom formatting with style tags, and IMO those should just be discarded.

The only data we're bringing over is data belonging to users that submitted games: their games and their posts. I've been leaning away from migrating the comments, but that's TBD (also there are multiple comment systems,: wordpress's and the proprietary comments on game pages). Again this would only be comments from people that submitted games.

mikekasprzak avatar Aug 08 '22 18:08 mikekasprzak

@mikekasprzak For me, the first order of things is to homogenize the different sources of the old data and bring them to a common meta-format that we can later transfer into the target CommonMark code.

I am pretty decent with Python and have some experience scripting these kinds of transformations. I don't know a lot about the format requirements for the current website yet, but I could offer to work on homogenizing the existing data into the meta-format. I am thinking schema-based json or SQLite. Doesn't matter really, since it's only a staging format to have a well-defined starting point for the final transformation. Anything that makes checking for missing data easy would work. In my experience, easy and visual is the best paradigm when it comes to cleaning up large amounts of data. Separates the data from the design and makes it simpler to spot outliers and missing fields. We can then do the final transformation once the raw data is nice and shiny.

Fixing the formats is also something that can be homogenized. I can search for any potential control characters, split them into convertible and non-convertible and script a conversion to either CommonMark or nil them as you mentioned. The goal would be to have all the free-text fields only contain CommonMark-compatible control strings, which should be straightforward to ensure in the script.

Would it be possible to give me access to the whole dataset that you want to transform, including basically everything? Even binaries like screenshots (if part of your backup and not externally hosted), etc? If necessary, stripped by any personal/sensitive information like password hashes, unless they should be part of the transformation. I would also try to scrape the images from external sources so we are not dependent on their uptime anymore, and you can decide later whether you want to host them yourself (not sure what your plans are in that regard concerning server space).

At this point, it doesn't matter if its a messy heap of different formats, but it should be as complete as possible, so I can finalize the script without having to fiddle around with stragglers that appear at a later point. If it's well-defined and text-based, proprietary or not, I can probably parse it, using custom regexes for custom formats if necessary. My goal would be to extract as much as I possibly can so if migrating the comments is something you would like to do, we should try. Not sure if there are non-technical limitations though, but if it can be done conceptually, I would like to at least attempt it. I hate losing data.

So if you are on board with this process, I would start looking at the big pile and see what I can do in terms of homogenization. I might need to get back to you with some questions about what fields are actually needed during the transform and what to do with corner cases. But then I'd be able to start getting everything into the meta format.

I could also do the CommonMark conversion later if I get some info on how exactly it should look. After having a clean meta-data definition of the raw data that should be pretty straightforward I think. But first things first.

So, let me know what you think about this. I'm on vacation until the end of the week so I would prefer to get the biggest chunk out of the way until then if I can.

So if you can show me where I can get the big pile of everything, that would be neat.

Cerno-b avatar Aug 08 '22 20:08 Cerno-b

@mikekasprzak Is this still something you want to push forward? I think I can make some good progress based on your outline, but I would need data to work on.

Would it take significant effort on your side to collect the data? I understand you're very busy, so maybe we can do this step by step? Whatever you can provide data wise would put me in a position to begin scripting.

Cerno-b avatar Aug 28 '22 19:08 Cerno-b

Hey Cerno-B,

My apologies for the late response.

I've been struggling to figure out what to do about sharing data. Obviously I can't make everything public, since it contains emails, credentials, passwords, etc. I'm also not sure how to present it. To me it's several database tables, which isn't the most useful format to give anyone. For GDPR reasons I don't think I can dump "some data" and put it on GitHub either. 🤔

Also if I use something, I need to be able to maintain it and add to it. I don't know Python. 😅

I need to ponder this a bit more.

mikekasprzak avatar Oct 10 '22 07:10 mikekasprzak

@mikekasprzak Thanks for coming back to me. A few points of clarification:

On the technical side:

I don't mind being handed a database dump, I can work with a lot of formats as long as I can somehow bring them to plain text instead of figuring out how to read binary blobs. An SQL database dump would be no problem for me.

You would not need Python at all since I will only be using the scripts to convert your different sources to a single format that you can work with. That could be json or it could be a plain csv file. So if you want to maintain anything, it would be that file format, but if I understood you well, you wanted to convert the data to markdown anyway, so you only would need to maintain the result. The idea wasn't to have a script that is doing the conversion now and then, but instead to collect all the data, freeze it, convert it once and then never look back again. Of course that would mean keeping as much information intact as we can.

Basically the workflow would be: You hand me the dirty data and get clean data back in a format that you can work with. You can get the scripts as well if you want but that would not be necessary at all.

Now about the legal side:

I agree that making the whole dumps public will not work so I would offer to strip the data of all personalized information except user names or other IDs you would need to integrate the old stuff into the website. Afterwards you can publish the data.

Do you need the personalized data for the matching process or is the goal to simply strip it? I could either just remove it or give you a separate dataset that maps the username to any info you want to keep.

Of course that would mean you would need to trust me to handle the user data carefully on my side and delete everything after passing it back, which may be a big ask.

One option would be a bit more trouble to set up but if you could give me access to a remote PC where all the data is stored I would be able to do the work without having direct access to the data itself as I would only log into the machine to do the conversion.

Another alternative would be that you strip the user data on your side before handing it over. Since I don't know the exact specs of all the data format it's hard to tell if I can help you with that remotely, but maybe we can have a zoom call and you could walk me through the different data formats. If it's really all SQL, I think we should be able to create a dump without any critical data.

Let me know if any of those ideas sound workable to you.

Cerno-b avatar Oct 10 '22 09:10 Cerno-b

I personally think it would be nice to also have to original page as a read-only site if the migration to ldjam.com will not be complete (it probably can't be). Rendering out the old wordpress pages as static html and hosting that on a simple server and linking the migrated ldjam entries to the corresponding legacy pages seems like something feasible.

pschichtel avatar Oct 15 '22 22:10 pschichtel

@pschichtel I think that was the case with the old ludumdare.com website which was still active a few years ago. It was taken down for a reason, see

https://ludumdare.com/resources/archive/compo/

I think migrating will be the cleaner option, but I need Mike's support to get the old data.

There is an alternative, but I don't know where this data came from, possibly it was scraped off the archive:

https://ludumdata.openfu.com/

Cerno-b avatar Oct 16 '22 18:10 Cerno-b

Hey Cerno-B,

My apologies for the late response.

I've been struggling to figure out what to do about sharing data. Obviously I can't make everything public, since it contains emails, credentials, passwords, etc. I'm also not sure how to present it. To me it's several database tables, which isn't the most useful format to give anyone. For GDPR reasons I don't think I can dump "some data" and put it on GitHub either. 🤔

Also if I use something, I need to be able to maintain it and add to it. I don't know Python. 😅

I need to ponder this a bit more.

Hi @mikekasprzak

I'm ready to give this another try.

I am still convinced that the plan I have outlined for sanitizing the old data with a script and then bringing it into a form that is compatible with the current website data structures ia a good approach.

The open issue is that you feel hesitant about uploading the data somewhere for password security reasons, which I understand.

Here is my proposal: Would you be able to send me a copy of just the data from my account? You wouldn't have to worry about security for that dataset.

With that set I could write a python script that strips all sensitive information from the data. I could then commit the script so someone you trust can verify that the code is not doing anything dangerous and once it's cleared you would run it on the whole dataset on your side, which would yield a clean version free of any passwords or other sensitive information.

Then you could send this safe data dump to me or upload it somewhere and I can start with the full conversion.

Would this be something you'd feel comfortable with?

Cerno-b avatar Apr 24 '23 21:04 Cerno-b

And to clarify: You won't need to know python because the script won't have to be maintained. It's a one time transfer from the old format to whatever format you feel comfortable working with. The script will run once and won't be needed afterwards.

Cerno-b avatar Apr 24 '23 21:04 Cerno-b