07 - Scraping & Automation Tutorial¶
This tutorial covers the complete workflow for scraping Amazon best sellers data and automating the content creation pipeline using trend-playwright_2.py (scraper) and tester.py (automation tool).
📋 Overview¶
The scraping and automation system consists of two main components:
trend-playwright_2.py- Web scraper that collects Amazon best sellers data and downloads product imagestester.py- Automation tool that processes scraped data through the API pipeline
🎯 Workflow Overview¶
Amazon Best Sellers → Scraper → JSON Data → Automation Tool → API Pipeline → YouTube Upload
↓ ↓ ↓ ↓ ↓ ↓
Live Data Playwright amazon_best_ Interactive /batch → /gen- YouTube
Browser sellers.json Selection erate-video Channel
🔧 Part 1: Web Scraper (trend-playwright_2.py)¶
What It Does¶
The scraper performs three main functions:
- Scrapes Amazon Best Sellers - Extracts product information from Amazon's best sellers page
- Downloads Product Media - Automatically downloads product images/videos
- Saves Structured Data - Stores everything in JSON format for processing
Prerequisites¶
System Requirements¶
# Install Playwright browsers
pip install playwright
playwright install chromium
# Required packages
pip install requests beautifulsoup4 lxml
User Agent File¶
Create agent.txt with user agent strings:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Configuration¶
Key Settings¶
# Amazon referral ID (for affiliate links)
referral_id = "AGSM1323"
# Browser settings
headless=True # Run without GUI
# Image download settings
max_images_per_product = 5 # Download up to 5 images per product
# Rate limiting
sleep(1) # 1 second delay between requests
How It Works¶
1. Amazon Best Sellers Scraping¶
def scrape_amazon_best_sellers():
url = "https://www.amazon.com/gp/bestsellers"
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url)
# Wait for products to load
page.wait_for_selector(".p13n-sc-uncoverable-faceout")
# Extract product information
items = page.query_selector_all(".p13n-sc-uncoverable-faceout")
2. Data Extraction¶
For each product, the scraper extracts: - Rank - Position in best sellers list - Title - Product name - Link - Amazon product URL with affiliate tag - Price - Current price
3. Image/Video Download¶
def scrape_and_download_images(products):
# Create images directory
Path("images").mkdir(parents=True, exist_ok=True)
for item in products:
# Sanitize filename
safe_title = sanitize_filename(title)
# Download images/videos
# Priority: Video > High-res images > Large images
Output Files¶
amazon_best_sellers.json¶
[
{
"rank": 1,
"title": "Crocs Unisex Adult Classic Clog",
"link": "https://www.amazon.com/dp/B0014BYHJE?tag=AGSM1323",
"price": "$39.95"
}
]
Downloaded Images¶
images/
├── Crocs_Unisex_Adult_Classic_Clog1.jpg
├── Crocs_Unisex_Adult_Classic_Clog2.jpg
├── Hanes_Mens_Beefy-t_T-Shirt1.jpg
└── ...
Running the Scraper¶
Basic Usage¶
Expected Output¶
Scraping Amazon Best Sellers with Playwright...
Rank .1 : Crocs_Unisex_Adult_Classic_Clog1.jpg
Rank .1 : Crocs_Unisex_Adult_Classic_Clog2.jpg
Rank .2 : Hanes_Mens_Beefy-t_T-Shirt1.jpg
...
Data berhasil disimpan ke 'amazon_best_sellers.json'
Troubleshooting Scraper Issues¶
Common Issues¶
Playwright Not Installed:
Amazon Blocking Requests:
# Add more user agents to agent.txt
# Increase delay between requests
sleep(2) # Increase from 1 to 2 seconds
Images Not Downloading:
JSON Parsing Errors:
# The scraper handles multiple JSON formats:
# 1. jQuery.parseJSON format
# 2. var data = {} format
# 3. Fallback mechanisms for malformed JSON
🤖 Part 2: Automation Tool (tester.py)¶
What It Does¶
The automation tool provides an interactive interface to:
- Select Products - Choose from scraped Amazon data
- Generate Content - Create summaries and audio using the API
- Create Videos - Generate video content from images/audio
- Upload to YouTube - Automatically upload videos to YouTube
Prerequisites¶
API Server Running¶
Required Files¶
amazon_best_sellers.json(from scraper)- Product images in
images/directory - API server running on configured URL
Configuration¶
API Endpoint Configuration¶
# Production server
URL = 'http://35.239.71.238:8000'
# Local development
# URL = 'http://localhost:8000'
Request Headers¶
How It Works¶
1. Product Selection¶
def list_products():
print('List Of Products :')
for i in range(len(DATA)):
print(f" {i}. {DATA[i].get('title')[0:30]}...(see more)", end='\n')
2. Batch Processing (/batch endpoint)¶
def batch(choice):
# Generate summary + audio for selected product
json_data = {
'products': [{
'rank': DATA[choice].get('rank'),
'title': DATA[choice].get('title'),
'link': DATA[choice].get('link'),
'price': DATA[choice].get('price'),
}],
'bucket_name': 'merchant-ary',
'folder_path': 'sound-output',
'expiration': 3600,
}
3. Video Generation (/generate-video endpoint)¶
def generate_video(data, format):
# Create video from images and audio
json_data = {
'title': data[0]['title'],
'price': data[0]['price'],
'summary': data[0]['summary'],
'link': data[0]['link'],
'rank': data[0]['rank'],
'video_format': format, # 'regular' or 'shorts'
'upload_to_youtube': False,
'youtube_public': False,
'expiration': 3600,
}
4. YouTube Upload (/upload-youtube endpoint)¶
def upload_youtube(data, format):
# Upload video to YouTube
json_data = {
'video_path': f'/home/supradmin/apiv2/{data["local_video_path"]}',
'product_data': {
'rank': data['product_data']['rank'],
'title': title,
'link': data['product_data']['link'],
'price': data['product_data']['price']
},
'youtube_public': True if visible == "public" else False
}
Running the Automation Tool¶
Interactive Mode¶
Expected Interaction¶
List Of Products :
0. Crocs Unisex Adult Classic...(see more)
1. Hanes Men's Beefy-t T-Shi...(see more)
2. Hanes Men's Boxer Briefs,...(see more)
choose number > 0
[*] alright. you choose number 0.
[+] here i give u the details abt the product :
{'rank': 1, 'title': 'Crocs Unisex Adult Classic Clog', 'link': '...', 'price': '$39.95'}
[*] ok. lets process this product at /batch first to get the audio + summary
[*] processing at /batch
[+] successfully generated audio + summary !
[*] time to generate the video at /generate-video
[*] set the video's format (regular/shorts) > shorts
[*] processing at /generate-video
[+] video generated successfully !
[*] okay. lets upload this video to youtube
[*] set the video's title > Amazing Crocs Deal!
[*] set the visibility (public / private) ? > public
[*] video uploaded successfully ! here's the url : https://www.youtube.com/shorts/VIDEO_ID
Output Files Generated¶
Audio Files¶
Video Files¶
Troubleshooting Automation Issues¶
Common Issues¶
API Server Not Running:
# Check if server is running
curl http://localhost:8000/health
# Start server if needed
python main.py
Missing Product Images:
# Ensure images directory exists
ls images/
# Run scraper first if images missing
python trend-playwright_2.py
YouTube Upload Fails:
# Check YouTube credentials
ls sa/youtube_token.pickle
# Re-authenticate if needed
curl http://localhost:8000/youtube-auth-url
Video Generation Fails:
# Check FFmpeg installation
ffmpeg -version
# Verify audio file exists
ls audio/Crocs_Unisex_Adult_Classic_Clog.mp3
🔄 Complete Workflow¶
Step-by-Step Process¶
1. Scrape Amazon Data¶
Result:amazon_best_sellers.json + product images
2. Start API Server¶
Result: API server running on port 80003. Run Automation Tool¶
Result: Interactive product selection and processing4. Select Product¶
- Choose product number from list
- Tool displays product details
5. Generate Content¶
- Batch Processing: Creates summary + audio
- Video Generation: Combines images + audio into video
- YouTube Upload: Uploads video to YouTube
File Dependencies¶
amazon_best_sellers.json → tester.py
images/*.jpg → /generate-video endpoint
audio/*.mp3 → /generate-video endpoint
videos_output/*.mp4 → /upload-youtube endpoint
Data Flow¶
graph TD
A[Amazon Best Sellers] --> B[trend-playwright_2.py]
B --> C[amazon_best_sellers.json]
B --> D[images/*.jpg]
C --> E[tester.py]
E --> F[/batch endpoint]
F --> G[summary + audio]
G --> H[/generate-video endpoint]
D --> H
H --> I[video file]
I --> J[/upload-youtube endpoint]
J --> K[YouTube video]
⚙️ Advanced Configuration¶
Customizing the Scraper¶
Change Scraping Target¶
# Modify URL for different Amazon pages
url = "https://www.amazon.com/gp/bestsellers/electronics"
# Or different countries
url = "https://www.amazon.co.uk/gp/bestsellers"
Adjust Image Download Settings¶
# Download more images per product
if saved > 10: break # Instead of 5
# Download different image sizes
url = data[key][index].get('thumb') # Smaller images
Modify Filename Sanitization¶
def sanitize_filename(product_title):
# Custom sanitization logic
product_title = product_title.split("|")[0].strip()
safe_title = "".join(c for c in product_title if c.isalnum() or c in [' ', '_', '-']).replace(' ', '_')[:30] # Shorter limit
return safe_title
Customizing the Automation Tool¶
Change API Endpoints¶
Modify Video Settings¶
# Change default format
'video_format': 'regular', # Instead of asking user
# Always upload to YouTube
'upload_to_youtube': True,
Add Custom Processing¶
📊 Monitoring and Logging¶
Scraper Logs¶
The scraper provides real-time feedback:
Rank .1 : Crocs_Unisex_Adult_Classic_Clog1.jpg
Rank .1 : Crocs_Unisex_Adult_Classic_Clog2.jpg
Data berhasil disimpan ke 'amazon_best_sellers.json'
Automation Tool Logs¶
Color-coded output with status indicators: - 🟢 Green: Successful operations - 🟡 Yellow: API responses - 🔴 Red: Errors or warnings
Error Handling¶
Both tools include comprehensive error handling: - Network timeouts - API failures - Missing files - Invalid data formats
🔒 Security Considerations¶
API Keys¶
- Store API keys securely in environment variables
- Never commit credentials to version control
- Use different keys for development/production
Rate Limiting¶
- Built-in delays between requests
- Respect Amazon's terms of service
- Monitor API usage limits
Data Privacy¶
- Handle user data responsibly
- Comply with affiliate marketing guidelines
- Respect robots.txt and terms of service
🚀 Production Deployment¶
Automated Workflow¶
#!/bin/bash
# Daily scraping script
python trend-playwright_2.py
python tester.py --auto --format=shorts --upload
Scheduling¶
# Add to crontab for daily execution
crontab -e
# 0 9 * * * cd /path/to/project && /usr/bin/python3 trend-playwright_2.py
Monitoring¶
# Check if processes are running
ps aux | grep python
# Monitor disk usage
du -sh images/ audio/ videos_output/
# Check API server status
curl http://localhost:8000/health
📈 Performance Optimization¶
Scraper Optimization¶
- Use headless browser mode
- Implement request caching
- Parallel image downloads
- Smart retry mechanisms
Automation Optimization¶
- Batch API requests
- Concurrent processing
- Memory-efficient file handling
- Progress indicators
System Resources¶
- Monitor CPU/memory usage
- Clean up temporary files
- Optimize storage usage
- Implement file rotation
🆘 Troubleshooting Guide¶
Scraper Issues¶
"No products found" - Check internet connection - Verify Amazon page structure hasn't changed - Update selectors in the code
"Images not downloading"
- Check agent.txt file exists
- Verify user agents are current
- Check disk space and permissions
"JSON parsing errors" - Amazon may have changed page structure - Update regex patterns - Add fallback parsing methods
Automation Issues¶
"API server not responding"
- Check if server is running: curl http://localhost:8000/health
- Verify correct URL in tester.py
- Check firewall settings
"Video generation fails" - Ensure FFmpeg is installed - Check audio/image files exist - Verify file permissions
"YouTube upload fails" - Check YouTube credentials are valid - Verify OAuth token hasn't expired - Check video file exists and is valid
📝 Best Practices¶
Code Organization¶
- Keep scraper and automation separate
- Use configuration files for settings
- Implement proper error handling
- Add logging for debugging
Data Management¶
- Regular backup of scraped data
- Clean up old files periodically
- Validate data integrity
- Handle duplicates gracefully
API Usage¶
- Respect rate limits
- Implement retry mechanisms
- Monitor API usage and costs
- Handle API changes gracefully
Security¶
- Use environment variables for secrets
- Implement proper authentication
- Regular security updates
- Monitor for vulnerabilities
🎯 Next Steps¶
- Run the Scraper:
python trend-playwright_2.py - Start API Server:
python main.py - Run Automation:
python tester.py - Monitor Results: Check generated content
- Customize: Modify settings for your needs
This tutorial provides everything needed to scrape Amazon data and automate the content creation pipeline from start to finish.