GitHub - zzzgydi/webscraper: Scrape the webpage convert it into Markdown, and enhance AI search applications.

## Web Scraper

[Permalink: Web Scraper](#web-scraper)

Scrape the webpage, convert it into Markdown, and enhance AI search applications.

## Running

[Permalink: Running](#running)

To run this project, you need to create a `config/dev.yaml` configuration file. You can copy a template from `config/temp.yaml`.

Then, use the following command to run the project directly on your local machine, requiring Chrome to be installed:

Open the following URL in the browser:

```
http://127.0.0.1:4090?u=https://github.com/zzzgydi/webscraper

```

use HTTP GET mode:

```
http://127.0.0.1:4090?u=https://github.com/zzzgydi/webscraper&headless=false

```

use HTTP GET mode and disable readability:

```
http://127.0.0.1:4090?u=https://github.com/zzzgydi/webscraper&headless=false&readability=false

```

OR you can use it with curl:

#### POST /v1/scrape

[Permalink: POST /v1/scrape](#post-v1scrape)

Scrapes the webpage and returns the result in Markdown format.

Request Body

ParameterTypeDescriptionurl\_listarrayList of URLs to scrapeheadlessboolean(Optional) Whether to run in headless modereadabilityboolean(Optional) Whether to enhance readability of HTML

Example Request:

```
curl -X POST -H "Content-Type: application/json" \
     -d '{"url_list":["https://google.com"], "headless": false, "readability": false}' \
     http://127.0.0.1:4090/v1/scrape
```

This will start the server and output logs to the `output/log` directory.

## Deployment

[Permalink: Deployment](#deployment)

Create a `config/prod.yaml` file and set Chrome's `remote_url` to `ws://chromedp:9222` if you run the following command:

## Contributions

[Permalink: Contributions](#contributions)

Any form of contribution is welcome. If you have any questions or suggestions, please create an issue.

## Acknowledgments

[Permalink: Acknowledgments](#acknowledgments)

- github.com/chromedp/chromedp
- github.com/PuerkitoBio/goquery
- github.com/JohannesKaufmann/html-to-markdown
- and so on...

## License

[Permalink: License](#license)

This project is released under the MIT license. For details, please see the [LICENSE](https://github.com/zzzgydi/webscraper/blob/main/LICENSE) file.