07 - Scraping & Automation Tutorial¶

This tutorial covers the complete workflow for scraping Amazon best sellers data and automating the content creation pipeline using trend-playwright_2.py (scraper) and tester.py (automation tool).

📋 Overview¶

The scraping and automation system consists of two main components:

trend-playwright_2.py - Web scraper that collects Amazon best sellers data and downloads product images
tester.py - Automation tool that processes scraped data through the API pipeline

🎯 Workflow Overview¶

Amazon Best Sellers → Scraper → JSON Data → Automation Tool → API Pipeline → YouTube Upload
     ↓                ↓           ↓              ↓                ↓              ↓
  Live Data      Playwright    amazon_best_   Interactive     /batch → /gen-   YouTube
                 Browser       sellers.json    Selection      erate-video     Channel

🔧 Part 1: Web Scraper (`trend-playwright_2.py`)¶

What It Does¶

The scraper performs three main functions:

Scrapes Amazon Best Sellers - Extracts product information from Amazon's best sellers page
Downloads Product Media - Automatically downloads product images/videos
Saves Structured Data - Stores everything in JSON format for processing

Prerequisites¶

System Requirements¶

# Install Playwright browsers
pip install playwright
playwright install chromium

# Required packages
pip install requests beautifulsoup4 lxml

User Agent File¶

Create agent.txt with user agent strings:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

Configuration¶

Key Settings¶

# Amazon referral ID (for affiliate links)
referral_id = "AGSM1323"

# Browser settings
headless=True  # Run without GUI

# Image download settings
max_images_per_product = 5  # Download up to 5 images per product

# Rate limiting
sleep(1)  # 1 second delay between requests

How It Works¶

1. Amazon Best Sellers Scraping¶

def scrape_amazon_best_sellers():
    url = "https://www.amazon.com/gp/bestsellers"

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)

        # Wait for products to load
        page.wait_for_selector(".p13n-sc-uncoverable-faceout")

        # Extract product information
        items = page.query_selector_all(".p13n-sc-uncoverable-faceout")

2. Data Extraction¶

For each product, the scraper extracts: - Rank - Position in best sellers list - Title - Product name - Link - Amazon product URL with affiliate tag - Price - Current price

3. Image/Video Download¶

def scrape_and_download_images(products):
    # Create images directory
    Path("images").mkdir(parents=True, exist_ok=True)

    for item in products:
        # Sanitize filename
        safe_title = sanitize_filename(title)

        # Download images/videos
        # Priority: Video > High-res images > Large images

Output Files¶

`amazon_best_sellers.json`¶

[
    {
        "rank": 1,
        "title": "Crocs Unisex Adult Classic Clog",
        "link": "https://www.amazon.com/dp/B0014BYHJE?tag=AGSM1323",
        "price": "$39.95"
    }
]

Downloaded Images¶

images/
├── Crocs_Unisex_Adult_Classic_Clog1.jpg
├── Crocs_Unisex_Adult_Classic_Clog2.jpg
├── Hanes_Mens_Beefy-t_T-Shirt1.jpg
└── ...

Running the Scraper¶

Basic Usage¶

python trend-playwright_2.py

Expected Output¶

Scraping Amazon Best Sellers with Playwright...
Rank .1 : Crocs_Unisex_Adult_Classic_Clog1.jpg
Rank .1 : Crocs_Unisex_Adult_Classic_Clog2.jpg
Rank .2 : Hanes_Mens_Beefy-t_T-Shirt1.jpg
...
Data berhasil disimpan ke 'amazon_best_sellers.json'

Troubleshooting Scraper Issues¶

Common Issues¶

Playwright Not Installed:

pip install playwright
playwright install chromium

Amazon Blocking Requests:

# Add more user agents to agent.txt
# Increase delay between requests
sleep(2)  # Increase from 1 to 2 seconds

Images Not Downloading:

# Check internet connection
# Verify images directory permissions
chmod 755 images/

JSON Parsing Errors:

# The scraper handles multiple JSON formats:
# 1. jQuery.parseJSON format
# 2. var data = {} format
# 3. Fallback mechanisms for malformed JSON

🤖 Part 2: Automation Tool (`tester.py`)¶

What It Does¶

The automation tool provides an interactive interface to:

Select Products - Choose from scraped Amazon data
Generate Content - Create summaries and audio using the API
Create Videos - Generate video content from images/audio
Upload to YouTube - Automatically upload videos to YouTube

Prerequisites¶

API Server Running¶

# Start the API server
python main.py

# Or using Docker
docker-compose up mkdocs

Required Files¶

amazon_best_sellers.json (from scraper)
Product images in images/ directory
API server running on configured URL

Configuration¶

API Endpoint Configuration¶

# Production server
URL = 'http://35.239.71.238:8000'

# Local development
# URL = 'http://localhost:8000'

Request Headers¶

HEADER = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}

How It Works¶

1. Product Selection¶

def list_products():
    print('List Of Products :')
    for i in range(len(DATA)):
        print(f" {i}. {DATA[i].get('title')[0:30]}...(see more)", end='\n')

2. Batch Processing (`/batch` endpoint)¶

def batch(choice):
    # Generate summary + audio for selected product
    json_data = {
        'products': [{
            'rank': DATA[choice].get('rank'),
            'title': DATA[choice].get('title'),
            'link': DATA[choice].get('link'),
            'price': DATA[choice].get('price'),
        }],
        'bucket_name': 'merchant-ary',
        'folder_path': 'sound-output',
        'expiration': 3600,
    }

3. Video Generation (`/generate-video` endpoint)¶

def generate_video(data, format):
    # Create video from images and audio
    json_data = {
        'title': data[0]['title'],
        'price': data[0]['price'],
        'summary': data[0]['summary'],
        'link': data[0]['link'],
        'rank': data[0]['rank'],
        'video_format': format,  # 'regular' or 'shorts'
        'upload_to_youtube': False,
        'youtube_public': False,
        'expiration': 3600,
    }

4. YouTube Upload (`/upload-youtube` endpoint)¶

def upload_youtube(data, format):
    # Upload video to YouTube
    json_data = {
        'video_path': f'/home/supradmin/apiv2/{data["local_video_path"]}',
        'product_data': {
            'rank': data['product_data']['rank'],
            'title': title,
            'link': data['product_data']['link'],
            'price': data['product_data']['price']
        },
        'youtube_public': True if visible == "public" else False
    }

Running the Automation Tool¶

Interactive Mode¶

python tester.py

Expected Interaction¶

List Of Products :
 0. Crocs Unisex Adult Classic...(see more)
 1. Hanes Men's Beefy-t T-Shi...(see more)
 2. Hanes Men's Boxer Briefs,...(see more)

 choose number > 0

 [*] alright. you choose number 0.
 [+] here i give u the details abt the product :
 {'rank': 1, 'title': 'Crocs Unisex Adult Classic Clog', 'link': '...', 'price': '$39.95'}
 [*] ok. lets process this product at /batch first to get the audio + summary
 [*] processing at /batch
 [+] successfully generated audio + summary !
 [*] time to generate the video at /generate-video
 [*] set the video's format (regular/shorts) > shorts
 [*] processing at /generate-video
 [+] video generated successfully !
 [*] okay. lets upload this video to youtube
 [*] set the video's title > Amazing Crocs Deal!
 [*] set the visibility (public / private) ? > public
 [*] video uploaded successfully ! here's the url : https://www.youtube.com/shorts/VIDEO_ID

Output Files Generated¶

Audio Files¶

audio/
├── Crocs_Unisex_Adult_Classic_Clog.mp3
└── ...

Video Files¶

videos_output/
├── Crocs_Unisex_Adult_Classic_Clog.mp4
└── ...

Troubleshooting Automation Issues¶

Common Issues¶

API Server Not Running:

# Check if server is running
curl http://localhost:8000/health

# Start server if needed
python main.py

Missing Product Images:

# Ensure images directory exists
ls images/

# Run scraper first if images missing
python trend-playwright_2.py

YouTube Upload Fails:

# Check YouTube credentials
ls sa/youtube_token.pickle

# Re-authenticate if needed
curl http://localhost:8000/youtube-auth-url

Video Generation Fails:

# Check FFmpeg installation
ffmpeg -version

# Verify audio file exists
ls audio/Crocs_Unisex_Adult_Classic_Clog.mp3

🔄 Complete Workflow¶

Step-by-Step Process¶

1. Scrape Amazon Data¶

python trend-playwright_2.py

Result: amazon_best_sellers.json + product images

2. Start API Server¶

python main.py

Result: API server running on port 8000

3. Run Automation Tool¶

python tester.py

Result: Interactive product selection and processing

4. Select Product¶

Choose product number from list
Tool displays product details

5. Generate Content¶

Batch Processing: Creates summary + audio
Video Generation: Combines images + audio into video
YouTube Upload: Uploads video to YouTube

File Dependencies¶

amazon_best_sellers.json  →  tester.py
images/*.jpg             →  /generate-video endpoint
audio/*.mp3             →  /generate-video endpoint
videos_output/*.mp4     →  /upload-youtube endpoint

Data Flow¶

graph TD
    A[Amazon Best Sellers] --> B[trend-playwright_2.py]
    B --> C[amazon_best_sellers.json]
    B --> D[images/*.jpg]
    C --> E[tester.py]
    E --> F[/batch endpoint]
    F --> G[summary + audio]
    G --> H[/generate-video endpoint]
    D --> H
    H --> I[video file]
    I --> J[/upload-youtube endpoint]
    J --> K[YouTube video]

⚙️ Advanced Configuration¶

Customizing the Scraper¶

Change Scraping Target¶

# Modify URL for different Amazon pages
url = "https://www.amazon.com/gp/bestsellers/electronics"

# Or different countries
url = "https://www.amazon.co.uk/gp/bestsellers"

Adjust Image Download Settings¶

# Download more images per product
if saved > 10: break  # Instead of 5

# Download different image sizes
url = data[key][index].get('thumb')  # Smaller images

Modify Filename Sanitization¶

def sanitize_filename(product_title):
    # Custom sanitization logic
    product_title = product_title.split("|")[0].strip()
    safe_title = "".join(c for c in product_title if c.isalnum() or c in [' ', '_', '-']).replace(' ', '_')[:30]  # Shorter limit
    return safe_title

Customizing the Automation Tool¶

Change API Endpoints¶

# Use different server
URL = 'http://your-custom-server.com:8000'

Modify Video Settings¶

# Change default format
'video_format': 'regular',  # Instead of asking user

# Always upload to YouTube
'upload_to_youtube': True,

Add Custom Processing¶

# Add custom API calls
def custom_processing(data):
    # Your custom logic here
    pass

📊 Monitoring and Logging¶

Scraper Logs¶

The scraper provides real-time feedback:

Rank .1 : Crocs_Unisex_Adult_Classic_Clog1.jpg
Rank .1 : Crocs_Unisex_Adult_Classic_Clog2.jpg
Data berhasil disimpan ke 'amazon_best_sellers.json'

Automation Tool Logs¶

Color-coded output with status indicators: - 🟢 Green: Successful operations - 🟡 Yellow: API responses - 🔴 Red: Errors or warnings

Error Handling¶

Both tools include comprehensive error handling: - Network timeouts - API failures - Missing files - Invalid data formats

🔒 Security Considerations¶

API Keys¶

Store API keys securely in environment variables
Never commit credentials to version control
Use different keys for development/production

Rate Limiting¶

Built-in delays between requests
Respect Amazon's terms of service
Monitor API usage limits

Data Privacy¶

Handle user data responsibly
Comply with affiliate marketing guidelines
Respect robots.txt and terms of service

🚀 Production Deployment¶

Automated Workflow¶

#!/bin/bash
# Daily scraping script
python trend-playwright_2.py
python tester.py --auto --format=shorts --upload

Scheduling¶

# Add to crontab for daily execution
crontab -e
# 0 9 * * * cd /path/to/project && /usr/bin/python3 trend-playwright_2.py

Monitoring¶

# Check if processes are running
ps aux | grep python

# Monitor disk usage
du -sh images/ audio/ videos_output/

# Check API server status
curl http://localhost:8000/health

📈 Performance Optimization¶

Scraper Optimization¶

Use headless browser mode
Implement request caching
Parallel image downloads
Smart retry mechanisms

Automation Optimization¶

Batch API requests
Concurrent processing
Memory-efficient file handling
Progress indicators

System Resources¶

Monitor CPU/memory usage
Clean up temporary files
Optimize storage usage
Implement file rotation

🆘 Troubleshooting Guide¶

Scraper Issues¶

"No products found" - Check internet connection - Verify Amazon page structure hasn't changed - Update selectors in the code

"Images not downloading" - Check agent.txt file exists - Verify user agents are current - Check disk space and permissions

"JSON parsing errors" - Amazon may have changed page structure - Update regex patterns - Add fallback parsing methods

Automation Issues¶

"API server not responding" - Check if server is running: curl http://localhost:8000/health - Verify correct URL in tester.py - Check firewall settings

"Video generation fails" - Ensure FFmpeg is installed - Check audio/image files exist - Verify file permissions

"YouTube upload fails" - Check YouTube credentials are valid - Verify OAuth token hasn't expired - Check video file exists and is valid

📝 Best Practices¶

Code Organization¶

Keep scraper and automation separate
Use configuration files for settings
Implement proper error handling
Add logging for debugging

Data Management¶

Regular backup of scraped data
Clean up old files periodically
Validate data integrity
Handle duplicates gracefully

API Usage¶

Respect rate limits
Implement retry mechanisms
Monitor API usage and costs
Handle API changes gracefully

Security¶

Use environment variables for secrets
Implement proper authentication
Regular security updates
Monitor for vulnerabilities

🎯 Next Steps¶

Run the Scraper: python trend-playwright_2.py
Start API Server: python main.py
Run Automation: python tester.py
Monitor Results: Check generated content
Customize: Modify settings for your needs

This tutorial provides everything needed to scrape Amazon data and automate the content creation pipeline from start to finish.

07 - Scraping & Automation Tutorial¶

📋 Overview¶

🎯 Workflow Overview¶

🔧 Part 1: Web Scraper (trend-playwright_2.py)¶

What It Does¶

Prerequisites¶

System Requirements¶

User Agent File¶

Configuration¶

Key Settings¶

How It Works¶

1. Amazon Best Sellers Scraping¶

2. Data Extraction¶

3. Image/Video Download¶

Output Files¶

amazon_best_sellers.json¶

Downloaded Images¶

Running the Scraper¶

Basic Usage¶

Expected Output¶

Troubleshooting Scraper Issues¶

Common Issues¶

🤖 Part 2: Automation Tool (tester.py)¶

What It Does¶

Prerequisites¶

API Server Running¶

Required Files¶

Configuration¶

API Endpoint Configuration¶

Request Headers¶

How It Works¶

1. Product Selection¶

2. Batch Processing (/batch endpoint)¶

3. Video Generation (/generate-video endpoint)¶

4. YouTube Upload (/upload-youtube endpoint)¶

Running the Automation Tool¶

Interactive Mode¶

Expected Interaction¶

Output Files Generated¶

Audio Files¶

Video Files¶

Troubleshooting Automation Issues¶

Common Issues¶

🔄 Complete Workflow¶

Step-by-Step Process¶

1. Scrape Amazon Data¶

2. Start API Server¶

3. Run Automation Tool¶

4. Select Product¶

5. Generate Content¶

File Dependencies¶

Data Flow¶

⚙️ Advanced Configuration¶

Customizing the Scraper¶

Change Scraping Target¶

Adjust Image Download Settings¶

Modify Filename Sanitization¶

Customizing the Automation Tool¶

Change API Endpoints¶

Modify Video Settings¶

Add Custom Processing¶

📊 Monitoring and Logging¶

Scraper Logs¶

Automation Tool Logs¶

Error Handling¶

🔒 Security Considerations¶

API Keys¶

Rate Limiting¶

Data Privacy¶

🚀 Production Deployment¶

Automated Workflow¶

Scheduling¶

Monitoring¶

📈 Performance Optimization¶

Scraper Optimization¶

Automation Optimization¶

System Resources¶

🆘 Troubleshooting Guide¶

Scraper Issues¶

Automation Issues¶

🔧 Part 1: Web Scraper (`trend-playwright_2.py`)¶

`amazon_best_sellers.json`¶

🤖 Part 2: Automation Tool (`tester.py`)¶

2. Batch Processing (`/batch` endpoint)¶

3. Video Generation (`/generate-video` endpoint)¶

4. YouTube Upload (`/upload-youtube` endpoint)¶