love

Author Topic: meta  (Read 116367 times)

DarkeningHumour

  • Objectively Awesome
  • ******
  • Posts: 10453
  • When not sure if sarcasm look at username.
    • Pretentiously Yours
Re: meta
« Reply #760 on: June 14, 2018, 08:37:53 AM »
Whereas I'm an old soul.
« Society is dumb. Art is everything. » - Junior

https://pretensiouslyyours.wordpress.com/

pixote

  • Administrator
  • Objectively Awesome
  • ******
  • Posts: 34237
  • Up with generosity!
    • yet more inanities!
Re: meta
« Reply #761 on: October 23, 2019, 12:02:15 PM »


I am twelve years old.

pixote
Great  |  Near Great  |  Very Good  |  Good  |  Fair  |  Mixed  |  Middling  |  Bad

Junior

  • Bert Macklin, FBI
  • Global Moderator
  • Objectively Awesome
  • ******
  • Posts: 28709
  • What's the rumpus?
    • Benefits of a Classical Education
Re: meta
« Reply #762 on: October 23, 2019, 12:50:09 PM »
Happs!
Check out my blog of many topics

“I’m not a quitter, Kimmy! I watched Interstellar all the way to the end!”

smirnoff

  • Objectively Awesome
  • ******
  • Posts: 26251
    • smirnoff's Top 100
Re: meta
« Reply #763 on: October 23, 2019, 10:19:28 PM »
Next year it's your Right of Ascension ceremony, pixote. With Painstiks! :))

smirnoff

  • Objectively Awesome
  • ******
  • Posts: 26251
    • smirnoff's Top 100
Re: meta
« Reply #764 on: April 21, 2021, 08:24:57 AM »
List of all 9,364 threads on the forum.

All hyperlinked and searchable (ctrl+f).

Sometimes I know a thread exists somewhere but for whatever reason using Google or the integrated forum search tool simply fails. It's also too slow and cumbersom to search for minor iterations of your query (not to mention, taxing on the server). This is faster and guaranteed comprehensive.

Might be worth bookmarking.

1SO

  • FAB
  • Objectively Awesome
  • ******
  • Posts: 36128
  • Marathon Man
Re: meta
« Reply #765 on: April 21, 2021, 10:08:03 AM »
I was thinking about the nearly 200 redundant Director threads that are not in the Index and you've made it easier for me to fold them in.

I also like seeing all 865 Spoiler Threads alphabetized on one page.

Would love to delete the 450 Frame Game polls, especially with most of those images now dead links.

smirnoff

  • Objectively Awesome
  • ******
  • Posts: 26251
    • smirnoff's Top 100
Re: meta
« Reply #766 on: April 21, 2021, 10:43:54 AM »
Yea, those threads could certainly be purged. :)

Dave the Necrobumper

  • Objectively Awesome
  • ******
  • Posts: 12730
  • If I keep digging maybe I will get out of this hol
Re: meta
« Reply #767 on: April 21, 2021, 05:40:40 PM »
It must be nice having access to the database this forum uses (I am assuming you got the list directly from the DB). :)

Eric/E.T.

  • Elite Member
  • ****
  • Posts: 3830
Re: meta
« Reply #768 on: April 21, 2021, 08:52:44 PM »


I am twelve years old.

pixote

Three years, we throw you a quinceañera!
A witty saying proves nothing. - Voltaire

smirnoff

  • Objectively Awesome
  • ******
  • Posts: 26251
    • smirnoff's Top 100
Re: meta
« Reply #769 on: April 22, 2021, 04:21:27 AM »
It must be nice having access to the database this forum uses (I am assuming you got the list directly from the DB). :)

I wish that were so. I actually just did some gentle web scraping (i.e. used a chrome extension that navigated to every page on every sub forum and copied the links in found on each page, with a 2 second delay between requests). It was about 180 page loads, and took 6 or so minutes to complete to scrape.

As much as I'd like to do it, out of respect for our servers I won't try and copy every page from every thread. :))

Our forum stats say there are over 700,000 posts, but I think quite a lot have been deleted. If it were 500,000 posts, at 10 posts per page, that'd be 50,000 page requests... and if you gave the extension say 10 seconds per page to load (just to be safe since some pages are very image/gif heavy), that'd be like 140hrs to grab it all. And I would have doubts about the extensions ability to handle anywhere near that amount of data without crashing midway through the process or when trying to export it to a csv. I'm not sure how it's storing the data while processing, I wasn't watching my memory usage while it worked. The csv it spit out for me was only 3MB, but that was only 180 page loads worth of data, and was extracting only thread titles (averaging maybe 30 characters in length) and the corresponding links.

If the average post is a couple of sentences (a complete guess, but lets say 300 characters), I think that would work out to something like 5 gigs worth of data for the extension to keep in memory. I dunno. Not going to try that anyway, haha. At least not on our forum.



It was a 6 minute job but took two days to work out how to do it. I looked into, and tried, so many different methods before landing on something that worked for me.

-archive.org/The Way Back Machine has some ability to archive sites but if you want full archiving it's quite an expensive service. Pros: it would be fully browsable. Cons, slow as crap. Expensive. Couldn't figure out if it would even work.

-some web crawler called Heritrix. Like many of the potential solutions, I think this one fell apart for me because I couldn't make it work right from the command line. So much of these tools are built and have documentation for running in linux, and I've never been good with linux. I went down a rabbit hole of enabling some linux subsystem on windows and installing Ubuntu but trying to craft the right commands to do they exact thing I wanted to do seemed impossible, or would create so much garbage data I'd have to filter.

-HTTrack, some ancient seeming piece of web scraping software but at least it had a GUI. I could never get the settings to do what I wanted though.

-wget, was the most recommended tool for the job, but it's command line stuff only, and still suffered the problem of overcollection and making too much waste. there are a lot of softwares with a gui based on wget but I couldn't get good results with any of them.

-eventually I gave up on any commandline or stand alone softwares and went looking for browser extensions, tried several. SingleFileZ and WebScrapBook were powerful tools, but again failed to quite hit the spot.

-I bounced around between extensions and commandline solutions for a while. It got to the point where all my google searches were returning two pages worth of stackexchange links I'd already visited, and then a bunch of irrelevant results after that.

-I tried playing around with a Macro software I've used in the past called Pulover's Macro Creator. It's a great and powerful free software with a pretty intuitive interface... I recently used it to help translate the entire first chapter of Harry Potter from German to English, one sentence at a time, using DeepL (which does a much better job of translating to natural english than Google Translate). But the catch is you'd usually have to pay a subscription fee to translate something that long, or to access their API, so I made a macro to just copy and paste each sentence onto the DeepL site and grab the results.
Anyways, it worked great for that, but became a programming problem trying to get it to work for what I wanted.

-I hit on the idea of trying to find a spreadsheet solution since that's the data-handling environment I'm most comfortable in and about as close as I get to anything one might consider "programming". Google sheets has a great function called importxml that you can use to handle various tags you'll find in the source code of a site to grab very particular data. This would be perfect for the forum since links to subforums and threads are all contained in unique tags and once you work out the right expressions to isolation them, the rest would be easy. But for whatever reason this forum seemed to reject any request I'd make using this method. The requests come from a range of IP addresses belonging to google. My requests kept throwing errors, but worked find on every other site I tried, even other forums using SMF. The only thing I could conclude is that the IP's were blacklisted or filtered out by some firewall in place on the forum servers. Weirdly, one time after taking a break for a while I came back and some data had actually come through. So then I though oh, so maybe it's a timeout issue or something. But there's really no way to control that stuff within the importxml function.

-then I tried to write a custum script through google using the UrlFetchApp command but ran into similar problems as importxml.

-then I tried a custom add-on for Google Sheets called importFromWeb, which actually COULD access data from the forum, but there was so little documentation for it for the add-on, and almost no community to ask for help. I couldn't figure out how to write a suitable formula to make it work. It showed some promise, but ended in frustration.

-I even played around with some packet sniffers and tried to see if I could just filter grab the data that way. I believe the data is all there, the problem was automation. It would only be able to grab the data of things I actually accessed, and I didn't want to manually click every page.

-I thought there must be an extension that at least auto-saved the source code of every page one visits. SingleFile and Webscrapbook could do this in theory, and I spend quite a while trying to set both up to do just that, but kept running into issues. Also, it had no automation process. One neat thing though, one of the auto-saving extensions, I think maybe WebScrapBook, had some kind of block-chain usage ability. In theory it would be possible for multiple people to run the extension, which would autosave every page they visited on the forum, it would get automatically saved to a server the forum would be backed up a bit at a time (and continually refreshed), without actually costing the servers any additional load. At least that was my understanding... if a person were clever enough to get all the settings just right.

-eventually I came across the chrome extension Web Scraper. For whatever reason I never came across it sooner. Must not be as well known or have less discussion around it.

Great extension. So easy and powerful... also fantastic tutorials to work from. You don't even have to write any expressions or formulas. You identify the type of elements you want to extract from a page by clicking on them, and the extension figures out was makes them unique and auto-finds them in any page you bring up. Then you give it a range of pages to navigate, and the basic heirarchy of those pages, and it does the rest. You can also adjust settings so you don't clobber your target site with page requests. When it's done running you can just export the data as a csv and do as you like with it. It just worked. And created no unnesessary data. Only the stuff I wanted and nothing else. So clean and perfectly formatted. :))
« Last Edit: April 22, 2021, 04:47:04 AM by smirnoff »

 

love