Scrapy is a powerful web crawling and scraping framework for Python. It enables developers to extract data from websites and process it as needed. Integrating proxies into Scrapy helps bypass IP bans, access geo-restricted content, and maintain anonymity during scraping tasks.
Integrating ProxyJet with Scrapy allows users to leverage high-quality residential and ISP proxies, enhancing online anonymity, avoiding detection, and efficiently managing scraping tasks.
2. Create Account: If you don't use Google sign-up, please make sure you verify your email.
3. Complete Profile: Fill in your profile details.
4. Pick a Proxy Type: Choose the type of proxy you need and click "Order Now".
5. Pick Your Bandwidth: Select the bandwidth you need and click "Buy".
6. Complete the Payment: Proceed with the payment process.
7. Access the Dashboard: After payment, you will be redirected to the main dashboard where you will see your active plan. Click on "Proxy Generator".
8. Switch Proxy Format: Click the toggle on the right top side of the screen that switches the proxy format to Username:Password@IP:Port
.
9. Generate Proxy String: Select the proxy properties you need and click on the "+" button to generate the proxy string. You will get a string that looks something like this:
10. Great Job!: You have successfully generated your proxy!
You can pass the proxy details directly in the meta
parameter of each scrapy.Request
:
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com']
def start_requests(self): for url in self.start_urls: yield scrapy.Request( url=url, callback=self.parse, meta={"proxy": "http://A1B2C3D4E5-resi_region-US_Arizona_Phoenix:F6G7H8I9J0@proxy-jet.io:1010"} ) def parse(self, response): self.log(f'Title: {response.css("title::text").get()}')
middlewares.py
in your Scrapy project and add the following code:from w3lib.http import basic_auth_header
class ProxyMiddleware:
def init(self, proxy_url, proxy_user, proxy_pass):
self.proxy_url = proxy_url
self.proxy_user = proxy_user
self.proxy_pass = proxy_pass
@classmethod def from_crawler(cls, crawler): settings = crawler.settings return cls( proxy_url=settings.get('PROXY_URL'), proxy_user=settings.get('PROXY_USER'), proxy_pass=settings.get('PROXY_PASSWORD') ) def process_request(self, request, spider): proxy = f"http://{self.proxy_user}:{self.proxy_pass}@{self.proxy_url}" request.meta['proxy'] = proxy request.headers['Proxy-Authorization'] = basic_auth_header(self.proxy_user, self.proxy_pass)
settings.py
:settings.py
PROXY_URL = 'proxy-jet.io'
PROXY_USER = 'A1B2C3D4E5-resi_region-US_Arizona_Phoenix'
PROXY_PASSWORD = 'F6G7H8I9J0'
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.ProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}
By following these steps, you can integrate ProxyJet proxies with Scrapy to enhance your web scraping capabilities. This setup ensures that your requests are routed securely through ProxyJet’s high-quality proxies, making your data extraction tasks more reliable and less prone to blocking.