This tutorial demonstrates how to use the intelligent URL morphing system to process files from URLs without hardcoded file type detection.
from attachments import attach, load, modify, present
What is URL Morphing?¶
URL morphing is a smart system that:
- Downloads content from URLs
- Intelligently detects the file type using multiple strategies
- Transforms the attachment so existing loaders can process it
This enables seamless processing of any file type from URLs without maintaining hardcoded lists.
Basic Morphing Pattern¶
The standard pattern is: url_to_response → morph_to_detected_type → specific_loader
# Download and morph a PDF from URL
pdf_attachment = (attach("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf") |
load.url_to_response | # Step 1: Download content
modify.morph_to_detected_type | # Step 2: Detect file type intelligently
load.pdf_to_pdfplumber | # Step 3: Load with appropriate loader
present.text) # Step 4: Extract content
CropBox missing from /Page, defaulting to MediaBox
Let’s see what we got:
len(pdf_attachment.text)
458
Here’s the extracted PDF content:
print(pdf_attachment.text[:200])
PDF Document: sample.pdf
========================
[Page 1]
Hello PDF!
DOCUMENT ANALYSIS: This appears to be a scanned PDF with little to no extractable text.
- Pages processed: 1
- Pages with text
Perfect! The PDF was properly detected and loaded. Let’s examine how the detection worked:
pdf_attachment.path
'sample.pdf'
The original URL was transformed to a clean filename. Let’s see the detection metadata:
{
'detected_extension': pdf_attachment.metadata.get('detected_extension'),
'detection_method': pdf_attachment.metadata.get('detection_method'),
'content_type': pdf_attachment.metadata.get('response_content_type')
}
{'detected_extension': '.pdf',
'detection_method': 'enhanced_matcher_based',
'content_type': 'application/octet-stream'}
How Detection Works¶
The morphing system uses three strategies to detect file types:
- File extension from the URL path
- Content-Type header from the HTTP response
- Magic number signatures from the file content
This triple-check approach ensures reliable detection even when URLs don’t have clear extensions.
Working with Different File Types¶
Let’s try morphing with different file formats:
# PowerPoint presentation
pptx_result = (attach("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample_multipage.pptx") |
load.url_to_response |
modify.morph_to_detected_type |
load.pptx_to_python_pptx |
present.text)
PowerPoint content length:
len(pptx_result.text)
1427
Detected file type:
pptx_result.path
'sample_multipage.pptx'
# Markdown file
md_result = (attach("https://raw.githubusercontent.com/MaximeRivest/attachments/main/README.md") |
load.url_to_response |
modify.morph_to_detected_type |
load.text_to_string |
present.text)
Markdown content length:
len(md_result.text)
29883
Detected file type:
md_result.path
'README.md'
Why Morphing is Essential¶
Without morphing, URL processing fails because loaders use file extensions to determine if they should handle a file. URLs don’t match these patterns.
Let’s see what happens without morphing:
# This will fail - PDF loader won't recognize the URL
failed_attempt = (attach("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf") |
load.url_to_response | # Downloads content
load.pdf_to_pdfplumber | # But matcher fails - path is still a URL!
present.text)
Without morphing, the content isn’t processed correctly:
len(failed_attempt.text)
104
The result is just the response object representation:
failed_attempt.text
'https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf: <Response [200]>\n\n'
Automatic Morphing in Pipelines¶
The universal pipeline automatically includes morphing for URLs, so you often don’t need to specify it manually:
from attachments import Attachments
# This automatically uses morphing
auto_result = Attachments("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf")
auto_text = str(auto_result)
CropBox missing from /Page, defaulting to MediaBox
Automatic processing result length:
len(auto_text)
798
First part of the automatically processed content:
print(auto_text[:200])
# PDF Document: /tmp/tmphnl64qqm.pdf
## Page 1
Hello PDF!
📄 **Document Analysis**: This appears to be a scanned PDF with little to no extractable text.
- **Pages processed**: 1
- **Pages with tex
Advanced: Understanding the Detection Process¶
Let’s trace through how morphing detects a PDF file:
# Create an attachment with PDF URL
pdf_url = attach("https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf")
# Step 1: Download
downloaded = pdf_url | load.url_to_response
print("After download:")
print(f" Path: {downloaded.path}")
print(f" Content-Type: {downloaded.metadata.get('content_type')}")
print(f" Object type: {type(downloaded._obj)}")
# Step 2: Morph
morphed = downloaded | modify.morph_to_detected_type
print("\nAfter morphing:")
print(f" Path: {morphed.path}")
print(f" Detected extension: {morphed.metadata.get('detected_extension')}")
print(f" Detection method: {morphed.metadata.get('detection_method')}")
After download:
Path: https://github.com/MaximeRivest/attachments/raw/main/src/attachments/data/sample.pdf
Content-Type: application/octet-stream
Object type: <class 'requests.models.Response'>
After morphing:
Path: sample.pdf
Detected extension: .pdf
Detection method: enhanced_matcher_based
Best Practices¶
1. Use the Standard Pattern¶
For manual pipelines with URLs, always use: url_to_response → morph_to_detected_type
2. Let Enhanced Matchers Do Their Work¶
The system automatically checks file extensions, Content-Type headers, and magic numbers. No need for manual detection.
3. Trust the Detection¶
The morphing system is designed to be robust. It will fall back gracefully if detection is uncertain.
4. For Simple Cases, Use Automatic Processing¶
Attachments()
or the universal pipeline handle morphing automatically.
Summary¶
URL morphing enables seamless processing of any file type from URLs by:
- Downloading content intelligently
- Detecting file types using multiple strategies
- Transforming attachments so existing loaders can process them
- Maintaining zero hardcoded file type lists
This creates an extensible system where adding new file types automatically enables URL support!