I have been trying for hours to figure this out. From a building tutorial to just trying to find prebuilt ones, I can’t seem to make it click.

For context I am trying to scrape books myself that I can’t seem to find elsewhere so I can use and post them for others.

The scraper tutorial

Hackernoon tutorial by Ethan Jarell

I initially tried to follow this but I kept having a “couldn’t find module” error. Since I have never touched python prior to this, I am unaware how to fix this and the help links are not exactly helpful. If there’s someone who could guide me through this tutorial that would be great.

Selenium

Selenium Homepage

I don’t really get what this is but I think its some sort of python pack and it tells me to download using the pip command but that doesn’t seem to work (syntax error). I don’t know how to manually add it in because, again, I have little idea of what I’m doing.

Scrapy

Scrapy Homepage

This one seemed like it’d be an out-of-box deal but not only does it need the pip command to download but it has like 5 other dependencies it needs to function which complicates it more for me.

I am not criticizing these wares, I am just asking for help and if someone could help with the simplification of it all or maybe even point me to an easier method that would be amazing!


Updates

  • Figured out that I am supposed to run the command for pip in the command prompt thing on my computer, not the python runner. py -m followed by the pip request

  • Got the Ethan Jarrell tutorial to work and managed to add in selenium, which made me realize that selenium isn’t really helpful with the project. rip xP

  • Spent a bunch of time trying to workshop the basic scraper to work with dynamic sites, unsuccessful

  • Online self-help doesn’t go in as much as I would like, probably due to the legal grey area


  • chicken@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 hours ago

    The reason to use Selenium is if the website you want to scrape uses javascript in a way that inhibits getting content without a full browser environment. BeautifulSoup is just a parser, it can’t solve that problem.

    • aMockTie@beehaw.org
      link
      fedilink
      English
      arrow-up
      1
      ·
      56 minutes ago

      In my experience, this scenario typically means that there is some sort of API (very likely undocumented) that is being used on the backend. That requires a bit more investigation and testing with browser developer tools, the JS Console, and often trial and error. But once you overcome that (admittedly very complex and technical) hurdle, you can almost always get away with just using the requests library at that point.

      I’ve had to do that kind of thing more times than I’d like to admit, but the juice is almost always worth the squeeze.

    • Noah@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 hour ago

      This was the original plan but it doesn’t work as well for this on ‘dynamic’ websites

      • chicken@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        1
        ·
        51 minutes ago

        IIRC it should be able to be made to work since it does everything a browser does, found this search result, though it has been a while since I used it myself at all. Another thing you might try that has worked for me is iMacros, that’s a little simpler and more basic than Selenium but should work for what you say you want to do.