It must be nice having access to the database this forum uses (I am assuming you got the list directly from the DB).
I wish that were so. I actually just did some gentle web scraping (i.e. used a chrome extension that navigated to every page on every sub forum and copied the links in found on each page, with a 2 second delay between requests). It was about 180 page loads, and took 6 or so minutes to complete to scrape.
As much as I'd like to do it, out of respect for our servers I won't try and copy every page from every thread.
Our forum stats say there are over 700,000 posts, but I think quite a lot have been deleted. If it were 500,000 posts, at 10 posts per page, that'd be 50,000 page requests... and if you gave the extension say 10 seconds per page to load (just to be safe since some pages are very image/gif heavy), that'd be like 140hrs to grab it all. And I would have doubts about the extensions ability to handle anywhere near that amount of data without crashing midway through the process or when trying to export it to a csv. I'm not sure how it's storing the data while processing, I wasn't watching my memory usage while it worked. The csv it spit out for me was only 3MB, but that was only 180 page loads worth of data, and was extracting only thread titles (averaging maybe 30 characters in length) and the corresponding links.
If the average post is a couple of sentences (a complete guess, but lets say 300 characters), I think that would work out to something like 5 gigs worth of data for the extension to keep in memory. I dunno. Not going to try that anyway, haha. At least not on
our forum.
It was a 6 minute job but took two days to work out how to do it. I looked into, and tried, so many different methods before landing on something that worked for me.
-archive.org/The Way Back Machine has some ability to archive sites but if you want full archiving it's quite an expensive service. Pros: it would be fully browsable. Cons, slow as crap. Expensive. Couldn't figure out if it would even work.
-some web crawler called Heritrix. Like many of the potential solutions, I think this one fell apart for me because I couldn't make it work right from the command line. So much of these tools are built and have documentation for running in linux, and I've never been good with linux. I went down a rabbit hole of enabling some linux subsystem on windows and installing Ubuntu but trying to craft the right commands to do they exact thing I wanted to do seemed impossible, or would create so much garbage data I'd have to filter.
-HTTrack, some ancient seeming piece of web scraping software but at least it had a GUI. I could never get the settings to do what I wanted though.
-wget, was the most recommended tool for the job, but it's command line stuff only, and still suffered the problem of overcollection and making too much waste. there are a lot of softwares with a gui based on wget but I couldn't get good results with any of them.
-eventually I gave up on any commandline or stand alone softwares and went looking for browser extensions, tried several. SingleFileZ and WebScrapBook were powerful tools, but again failed to quite hit the spot.
-I bounced around between extensions and commandline solutions for a while. It got to the point where all my google searches were returning two pages worth of stackexchange links I'd already visited, and then a bunch of irrelevant results after that.
-I tried playing around with a Macro software I've used in the past called Pulover's Macro Creator. It's a great and powerful free software with a pretty intuitive interface... I recently used it to help translate the entire first chapter of Harry Potter from German to English, one sentence at a time, using DeepL (which does a much better job of translating to natural english than Google Translate). But the catch is you'd usually have to pay a subscription fee to translate something that long, or to access their API, so I made a macro to just copy and paste each sentence onto the DeepL site and grab the results.
Anyways, it worked great for that, but became a programming problem trying to get it to work for what I wanted.
-I hit on the idea of trying to find a spreadsheet solution since that's the data-handling environment I'm most comfortable in and about as close as I get to anything one might consider "programming". Google sheets has a great function called
importxml that you can use to handle various tags you'll find in the source code of a site to grab very particular data. This would be perfect for the forum since links to subforums and threads are all contained in unique tags and once you work out the right expressions to isolation them, the rest would be easy. But for whatever reason this forum seemed to reject any request I'd make using this method. The requests come from a range of IP addresses belonging to google. My requests kept throwing errors, but worked find on every other site I tried, even other forums using SMF. The only thing I could conclude is that the IP's were blacklisted or filtered out by some firewall in place on the forum servers. Weirdly, one time after taking a break for a while I came back and some data had actually come through. So then I though oh, so maybe it's a timeout issue or something. But there's really no way to control that stuff within the importxml function.
-then I tried to write a custum script through google using the UrlFetchApp command but ran into similar problems as importxml.
-then I tried a custom add-on for Google Sheets called importFromWeb, which actually COULD access data from the forum, but there was so little documentation for it for the add-on, and almost no community to ask for help. I couldn't figure out how to write a suitable formula to make it work. It showed some promise, but ended in frustration.
-I even played around with some packet sniffers and tried to see if I could just filter grab the data that way. I believe the data is all there, the problem was automation. It would only be able to grab the data of things I actually accessed, and I didn't want to manually click every page.
-I thought there must be an extension that at least auto-saved the source code of every page one visits. SingleFile and Webscrapbook could do this in theory, and I spend quite a while trying to set both up to do just that, but kept running into issues. Also, it had no automation process. One neat thing though, one of the auto-saving extensions, I think maybe WebScrapBook, had some kind of block-chain usage ability. In theory it would be possible for multiple people to run the extension, which would autosave every page they visited on the forum, it would get automatically saved to a server the forum would be backed up a bit at a time (and continually refreshed), without actually costing the servers any additional load. At least that was my understanding... if a person were clever enough to get all the settings just right.
-eventually I came across the chrome extension
Web Scraper. For whatever reason I never came across it sooner. Must not be as well known or have less discussion around it.
Great extension. So easy and powerful... also fantastic tutorials to work from. You don't even have to write any expressions or formulas. You identify the type of elements you want to extract from a page by clicking on them, and the extension figures out was makes them unique and auto-finds them in any page you bring up. Then you give it a range of pages to navigate, and the basic heirarchy of those pages, and it does the rest. You can also adjust settings so you don't clobber your target site with page requests. When it's done running you can just export the data as a csv and do as you like with it. It just worked. And created no unnesessary data. Only the stuff I wanted and nothing else. So clean and perfectly formatted.