I have been trying for hours to figure this out. From a building tutorial to just trying to find prebuilt ones, I can’t seem to make it click.

For context I am trying to scrape books myself that I can’t seem to find elsewhere so I can use and post them for others.

The scraper tutorial

Hackernoon tutorial by Ethan Jarell

I initially tried to follow this but I kept having a “couldn’t find module” error. Since I have never touched python prior to this, I am unaware how to fix this and the help links are not exactly helpful. If there’s someone who could guide me through this tutorial that would be great.

Selenium

Selenium Homepage

I don’t really get what this is but I think its some sort of python pack and it tells me to download using the pip command but that doesn’t seem to work (syntax error). I don’t know how to manually add it in because, again, I have little idea of what I’m doing.

Scrapy

Scrapy Homepage

This one seemed like it’d be an out-of-box deal but not only does it need the pip command to download but it has like 5 other dependencies it needs to function which complicates it more for me.

I am not criticizing these wares, I am just asking for help and if someone could help with the simplification of it all or maybe even point me to an easier method that would be amazing!


Updates

  • Figured out that I am supposed to run the command for pip in the command prompt thing on my computer, not the python runner. py -m followed by the pip request
  • umami_wasabi@lemmy.ml
    link
    fedilink
    English
    arrow-up
    6
    ·
    4 hours ago

    There is no simplification that you’re looking for. It seems you don’t have a programing background. If you really need to scrape something, you need to learn a programing language, HTTP, HTML, and maybe javascript. AFAIK, there is no easy way or point and click scrapper building tool. You will need to invest time and learn. Don’t worry, you should be able to get it done in 2-3 months if you do invest your time in.

    • Noah@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      4 hours ago

      I don’t want a point and click scraper, just a guide that isn’t assuming I have background + simple mans terms for easier reading. Thanks for believing in me to be able to build the basic skills necessary! Much appreciated :3

      • umami_wasabi@lemmy.ml
        link
        fedilink
        English
        arrow-up
        2
        ·
        edit-2
        1 hour ago

        I don’t a single guide for you but I can layout a road map.

        1. A programming language. I prefer Python.
        2. Basic HTML syntax and CSS selectors
        3. HTTP, specifically methods, status code (no need to memorize all cuz you can go look it up), and cookies

        After you got those foundation ready, you can go on and try to build a webscraper. I advice aginst using Scrapy. Not because it is bad but too overwhelming and abstracted for any beginner. I will instead advice you use requests for HTTP, and BeautifulSoup4 for HTML parsing. You will build a more solid foundation and transition to scrapy later when you need those advanced function.

        When you get stuck, don’t afraid to pause on your attempt and read tutorials again. Head to the Python Community on Discord to get interactive help. We welcome noobs as we once were noobs too. Just don’t ever mention scraping there as they can’t help if they suspect you’re trying to do something inappropriate, malicious, or illegal. They are notoriously aginst yt-dlp which frustrates me a bit. Phrase it nicely and in an generic way. I will be there occasionally offering help.

  • obbeel@lemmy.eco.br
    link
    fedilink
    English
    arrow-up
    6
    ·
    4 hours ago

    It needs a driver and the web-browser to be executed in headless mode. For Chrome that’s chrome-driver. You can get it here.

    To make a script for it, I recommend talking to a LLM. I have asked it to build scrapers before, so it does the job.

    If you want a practical use of Selenium being demonstrated, you can see it in LucidWebSearch plugin for Oobabooga.

    • Noah@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      3
      ·
      4 hours ago

      I recommend talking to a LLM

      Any recommendations? Not chat-GPT

      Also thanks for the help so far!

      • obbeel@lemmy.eco.br
        link
        fedilink
        English
        arrow-up
        4
        ·
        4 hours ago

        Gemini, Perplexity, Poe. Creating a Selenium script isn’t that hard for them. You can try running your own, but it’s more less likely that it will produce good results. Best coder LLM I’ve seen out there for hosting is Yi Coder 9B.

  • SpaceBishop@lemmy.zip
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    3 hours ago

    I am no expert, but I have used Python in a professional environment, and helped on board a Python newbie to build out his first project.

    It would be helpful to know what your environment looks like (what OS you are running, Python version, terminal interface – are you running cmd, powershell, terminal) and which steps prompts the reported error messages.

    Starting from the first time running Python using a Windows computer, the first steps should be

    Launch Powershell as admin and type in the following commands: set-executionpolicy remotesigned winget install python mkdir python cd python python -m venv scraper .\scraper\Scripts\activate

    Following that you should be able to use pip to install more modules or packages. I have Visual Studio Code as my IDE, and that means from there I can also run code to open the text editor to write whatever code I intend to run. Be sure to save it to C:\Users\youruseraccount\python If your scripts are saved to that folder, you can run them from powershell by just typing in their filename. Any time you run scripts, open powershell and type cd python and then .\scraper\Scripts\activate Hit enter, then type in the name of the script you want to run.

    This information dump is not the most detailed, but it should get you to the point that you can run your scripts.

  • fuckwit_mcbumcrumble@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 hours ago

    We use node.js with puppeteer for some of our web crawling at work. It’s pretty straightforward once you have a basic script to launch it. If you havent already I’d highly suggest installing vs code. You install node.js, then using npm (node package manager) install puppeteer and whatever other dependencies you might have. Someone out there probably has a basic js file out there that will open chrome, or just ask an LLM (I just use ChatGPT, they’re all the same shit). From there you just need to navigate to your pages, then use a queryselector and .click() to click on your elements. It’s all javascript from there.

    Pro tip: write your queryselectors in your browser using the inspect element/console tab, then put it in your JS file. Nothing is worse than being 10 minutes into a crawl and you’ve got a queerselector.

  • undefined@lemmy.hogru.ch
    link
    fedilink
    English
    arrow-up
    3
    ·
    4 hours ago

    Selenium is a “driver” that controls browsers, you would need some type of software to actually drive it. If you have programming experience it’s pretty easy to get going.

    Personally, I use it in Ruby on Rails development for unit testing but I also use it to log in to websites and perform some actions on behalf of a user (where the websites don’t offer an API).

    I don’t have experience with the others, but thought my comment may or may not be useful.

    • Noah@lemmy.dbzer0.comOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      4 hours ago

      I don’t have programming experience and what sorts of software can “drive” the driver?

      • fuckwit_mcbumcrumble@lemmy.dbzer0.com
        link
        fedilink
        English
        arrow-up
        1
        ·
        3 hours ago

        You’re going to want to do a lot more reading ahead of time then. It’s not hard, but you really need to know some basics about javascript before you start.

      • undefined@lemmy.hogru.ch
        link
        fedilink
        English
        arrow-up
        1
        ·
        3 hours ago

        I probably can’t be of much help yet unless for some reason you want to take up programming. I’m just not familiar with web scraping outside programming.