Skip to content

07 - Scraping & Automation Tutorial

This tutorial covers the complete workflow for scraping Amazon best sellers data and automating the content creation pipeline using trend-playwright_2.py (scraper) and tester.py (automation tool).

📋 Overview

The scraping and automation system consists of two main components:

  1. trend-playwright_2.py - Web scraper that collects Amazon best sellers data and downloads product images
  2. tester.py - Automation tool that processes scraped data through the API pipeline

🎯 Workflow Overview

Amazon Best Sellers → Scraper → JSON Data → Automation Tool → API Pipeline → YouTube Upload
     ↓                ↓           ↓              ↓                ↓              ↓
  Live Data      Playwright    amazon_best_   Interactive     /batch → /gen-   YouTube
                 Browser       sellers.json    Selection      erate-video     Channel

🔧 Part 1: Web Scraper (trend-playwright_2.py)

What It Does

The scraper performs three main functions:

  1. Scrapes Amazon Best Sellers - Extracts product information from Amazon's best sellers page
  2. Downloads Product Media - Automatically downloads product images/videos
  3. Saves Structured Data - Stores everything in JSON format for processing

Prerequisites

System Requirements

# Install Playwright browsers
pip install playwright
playwright install chromium

# Required packages
pip install requests beautifulsoup4 lxml

User Agent File

Create agent.txt with user agent strings:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36

Configuration

Key Settings

# Amazon referral ID (for affiliate links)
referral_id = "AGSM1323"

# Browser settings
headless=True  # Run without GUI

# Image download settings
max_images_per_product = 5  # Download up to 5 images per product

# Rate limiting
sleep(1)  # 1 second delay between requests

How It Works

1. Amazon Best Sellers Scraping

def scrape_amazon_best_sellers():
    url = "https://www.amazon.com/gp/bestsellers"

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)

        # Wait for products to load
        page.wait_for_selector(".p13n-sc-uncoverable-faceout")

        # Extract product information
        items = page.query_selector_all(".p13n-sc-uncoverable-faceout")

2. Data Extraction

For each product, the scraper extracts: - Rank - Position in best sellers list - Title - Product name - Link - Amazon product URL with affiliate tag - Price - Current price

3. Image/Video Download

def scrape_and_download_images(products):
    # Create images directory
    Path("images").mkdir(parents=True, exist_ok=True)

    for item in products:
        # Sanitize filename
        safe_title = sanitize_filename(title)

        # Download images/videos
        # Priority: Video > High-res images > Large images

Output Files

amazon_best_sellers.json

[
    {
        "rank": 1,
        "title": "Crocs Unisex Adult Classic Clog",
        "link": "https://www.amazon.com/dp/B0014BYHJE?tag=AGSM1323",
        "price": "$39.95"
    }
]

Downloaded Images

images/
├── Crocs_Unisex_Adult_Classic_Clog1.jpg
├── Crocs_Unisex_Adult_Classic_Clog2.jpg
├── Hanes_Mens_Beefy-t_T-Shirt1.jpg
└── ...

Running the Scraper

Basic Usage

python trend-playwright_2.py

Expected Output

Scraping Amazon Best Sellers with Playwright...
Rank .1 : Crocs_Unisex_Adult_Classic_Clog1.jpg
Rank .1 : Crocs_Unisex_Adult_Classic_Clog2.jpg
Rank .2 : Hanes_Mens_Beefy-t_T-Shirt1.jpg
...
Data berhasil disimpan ke 'amazon_best_sellers.json'

Troubleshooting Scraper Issues

Common Issues

Playwright Not Installed:

pip install playwright
playwright install chromium

Amazon Blocking Requests:

# Add more user agents to agent.txt
# Increase delay between requests
sleep(2)  # Increase from 1 to 2 seconds

Images Not Downloading:

# Check internet connection
# Verify images directory permissions
chmod 755 images/

JSON Parsing Errors:

# The scraper handles multiple JSON formats:
# 1. jQuery.parseJSON format
# 2. var data = {} format
# 3. Fallback mechanisms for malformed JSON


🤖 Part 2: Automation Tool (tester.py)

What It Does

The automation tool provides an interactive interface to:

  1. Select Products - Choose from scraped Amazon data
  2. Generate Content - Create summaries and audio using the API
  3. Create Videos - Generate video content from images/audio
  4. Upload to YouTube - Automatically upload videos to YouTube

Prerequisites

API Server Running

# Start the API server
python main.py

# Or using Docker
docker-compose up mkdocs

Required Files

  • amazon_best_sellers.json (from scraper)
  • Product images in images/ directory
  • API server running on configured URL

Configuration

API Endpoint Configuration

# Production server
URL = 'http://35.239.71.238:8000'

# Local development
# URL = 'http://localhost:8000'

Request Headers

HEADER = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}

How It Works

1. Product Selection

def list_products():
    print('List Of Products :')
    for i in range(len(DATA)):
        print(f" {i}. {DATA[i].get('title')[0:30]}...(see more)", end='\n')

2. Batch Processing (/batch endpoint)

def batch(choice):
    # Generate summary + audio for selected product
    json_data = {
        'products': [{
            'rank': DATA[choice].get('rank'),
            'title': DATA[choice].get('title'),
            'link': DATA[choice].get('link'),
            'price': DATA[choice].get('price'),
        }],
        'bucket_name': 'merchant-ary',
        'folder_path': 'sound-output',
        'expiration': 3600,
    }

3. Video Generation (/generate-video endpoint)

def generate_video(data, format):
    # Create video from images and audio
    json_data = {
        'title': data[0]['title'],
        'price': data[0]['price'],
        'summary': data[0]['summary'],
        'link': data[0]['link'],
        'rank': data[0]['rank'],
        'video_format': format,  # 'regular' or 'shorts'
        'upload_to_youtube': False,
        'youtube_public': False,
        'expiration': 3600,
    }

4. YouTube Upload (/upload-youtube endpoint)

def upload_youtube(data, format):
    # Upload video to YouTube
    json_data = {
        'video_path': f'/home/supradmin/apiv2/{data["local_video_path"]}',
        'product_data': {
            'rank': data['product_data']['rank'],
            'title': title,
            'link': data['product_data']['link'],
            'price': data['product_data']['price']
        },
        'youtube_public': True if visible == "public" else False
    }

Running the Automation Tool

Interactive Mode

python tester.py

Expected Interaction

List Of Products :
 0. Crocs Unisex Adult Classic...(see more)
 1. Hanes Men's Beefy-t T-Shi...(see more)
 2. Hanes Men's Boxer Briefs,...(see more)

 choose number > 0

 [*] alright. you choose number 0.
 [+] here i give u the details abt the product :
 {'rank': 1, 'title': 'Crocs Unisex Adult Classic Clog', 'link': '...', 'price': '$39.95'}
 [*] ok. lets process this product at /batch first to get the audio + summary
 [*] processing at /batch
 [+] successfully generated audio + summary !
 [*] time to generate the video at /generate-video
 [*] set the video's format (regular/shorts) > shorts
 [*] processing at /generate-video
 [+] video generated successfully !
 [*] okay. lets upload this video to youtube
 [*] set the video's title > Amazing Crocs Deal!
 [*] set the visibility (public / private) ? > public
 [*] video uploaded successfully ! here's the url : https://www.youtube.com/shorts/VIDEO_ID

Output Files Generated

Audio Files

audio/
├── Crocs_Unisex_Adult_Classic_Clog.mp3
└── ...

Video Files

videos_output/
├── Crocs_Unisex_Adult_Classic_Clog.mp4
└── ...

Troubleshooting Automation Issues

Common Issues

API Server Not Running:

# Check if server is running
curl http://localhost:8000/health

# Start server if needed
python main.py

Missing Product Images:

# Ensure images directory exists
ls images/

# Run scraper first if images missing
python trend-playwright_2.py

YouTube Upload Fails:

# Check YouTube credentials
ls sa/youtube_token.pickle

# Re-authenticate if needed
curl http://localhost:8000/youtube-auth-url

Video Generation Fails:

# Check FFmpeg installation
ffmpeg -version

# Verify audio file exists
ls audio/Crocs_Unisex_Adult_Classic_Clog.mp3


🔄 Complete Workflow

Step-by-Step Process

1. Scrape Amazon Data

python trend-playwright_2.py
Result: amazon_best_sellers.json + product images

2. Start API Server

python main.py
Result: API server running on port 8000

3. Run Automation Tool

python tester.py
Result: Interactive product selection and processing

4. Select Product

  • Choose product number from list
  • Tool displays product details

5. Generate Content

  • Batch Processing: Creates summary + audio
  • Video Generation: Combines images + audio into video
  • YouTube Upload: Uploads video to YouTube

File Dependencies

amazon_best_sellers.json  →  tester.py
images/*.jpg             →  /generate-video endpoint
audio/*.mp3             →  /generate-video endpoint
videos_output/*.mp4     →  /upload-youtube endpoint

Data Flow

graph TD
    A[Amazon Best Sellers] --> B[trend-playwright_2.py]
    B --> C[amazon_best_sellers.json]
    B --> D[images/*.jpg]
    C --> E[tester.py]
    E --> F[/batch endpoint]
    F --> G[summary + audio]
    G --> H[/generate-video endpoint]
    D --> H
    H --> I[video file]
    I --> J[/upload-youtube endpoint]
    J --> K[YouTube video]

⚙️ Advanced Configuration

Customizing the Scraper

Change Scraping Target

# Modify URL for different Amazon pages
url = "https://www.amazon.com/gp/bestsellers/electronics"

# Or different countries
url = "https://www.amazon.co.uk/gp/bestsellers"

Adjust Image Download Settings

# Download more images per product
if saved > 10: break  # Instead of 5

# Download different image sizes
url = data[key][index].get('thumb')  # Smaller images

Modify Filename Sanitization

def sanitize_filename(product_title):
    # Custom sanitization logic
    product_title = product_title.split("|")[0].strip()
    safe_title = "".join(c for c in product_title if c.isalnum() or c in [' ', '_', '-']).replace(' ', '_')[:30]  # Shorter limit
    return safe_title

Customizing the Automation Tool

Change API Endpoints

# Use different server
URL = 'http://your-custom-server.com:8000'

Modify Video Settings

# Change default format
'video_format': 'regular',  # Instead of asking user

# Always upload to YouTube
'upload_to_youtube': True,

Add Custom Processing

# Add custom API calls
def custom_processing(data):
    # Your custom logic here
    pass

📊 Monitoring and Logging

Scraper Logs

The scraper provides real-time feedback:

Rank .1 : Crocs_Unisex_Adult_Classic_Clog1.jpg
Rank .1 : Crocs_Unisex_Adult_Classic_Clog2.jpg
Data berhasil disimpan ke 'amazon_best_sellers.json'

Automation Tool Logs

Color-coded output with status indicators: - 🟢 Green: Successful operations - 🟡 Yellow: API responses - 🔴 Red: Errors or warnings

Error Handling

Both tools include comprehensive error handling: - Network timeouts - API failures - Missing files - Invalid data formats


🔒 Security Considerations

API Keys

  • Store API keys securely in environment variables
  • Never commit credentials to version control
  • Use different keys for development/production

Rate Limiting

  • Built-in delays between requests
  • Respect Amazon's terms of service
  • Monitor API usage limits

Data Privacy

  • Handle user data responsibly
  • Comply with affiliate marketing guidelines
  • Respect robots.txt and terms of service

🚀 Production Deployment

Automated Workflow

#!/bin/bash
# Daily scraping script
python trend-playwright_2.py
python tester.py --auto --format=shorts --upload

Scheduling

# Add to crontab for daily execution
crontab -e
# 0 9 * * * cd /path/to/project && /usr/bin/python3 trend-playwright_2.py

Monitoring

# Check if processes are running
ps aux | grep python

# Monitor disk usage
du -sh images/ audio/ videos_output/

# Check API server status
curl http://localhost:8000/health

📈 Performance Optimization

Scraper Optimization

  • Use headless browser mode
  • Implement request caching
  • Parallel image downloads
  • Smart retry mechanisms

Automation Optimization

  • Batch API requests
  • Concurrent processing
  • Memory-efficient file handling
  • Progress indicators

System Resources

  • Monitor CPU/memory usage
  • Clean up temporary files
  • Optimize storage usage
  • Implement file rotation

🆘 Troubleshooting Guide

Scraper Issues

"No products found" - Check internet connection - Verify Amazon page structure hasn't changed - Update selectors in the code

"Images not downloading" - Check agent.txt file exists - Verify user agents are current - Check disk space and permissions

"JSON parsing errors" - Amazon may have changed page structure - Update regex patterns - Add fallback parsing methods

Automation Issues

"API server not responding" - Check if server is running: curl http://localhost:8000/health - Verify correct URL in tester.py - Check firewall settings

"Video generation fails" - Ensure FFmpeg is installed - Check audio/image files exist - Verify file permissions

"YouTube upload fails" - Check YouTube credentials are valid - Verify OAuth token hasn't expired - Check video file exists and is valid


📝 Best Practices

Code Organization

  • Keep scraper and automation separate
  • Use configuration files for settings
  • Implement proper error handling
  • Add logging for debugging

Data Management

  • Regular backup of scraped data
  • Clean up old files periodically
  • Validate data integrity
  • Handle duplicates gracefully

API Usage

  • Respect rate limits
  • Implement retry mechanisms
  • Monitor API usage and costs
  • Handle API changes gracefully

Security

  • Use environment variables for secrets
  • Implement proper authentication
  • Regular security updates
  • Monitor for vulnerabilities

🎯 Next Steps

  1. Run the Scraper: python trend-playwright_2.py
  2. Start API Server: python main.py
  3. Run Automation: python tester.py
  4. Monitor Results: Check generated content
  5. Customize: Modify settings for your needs

This tutorial provides everything needed to scrape Amazon data and automate the content creation pipeline from start to finish.