Data Sources
Overview
memories-dev provides a comprehensive data acquisition system that supports multiple data sources for creating and enriching memories, including satellite imagery, vector data, and environmental metrics.
Supported Data Sources
Satellite Data
Sentinel-2: High-resolution multispectral imagery via Planetary Computer
Landsat: Medium-resolution multispectral imagery
Earth Engine: Access to Google Earth Engine datasets
COG/STAC: Cloud-optimized GeoTIFF and SpatioTemporal Asset Catalog
Vector Data
OpenStreetMap: Comprehensive global mapping data
Overture Maps: Detailed vector mapping data
WFS Services: Web Feature Service endpoints
Data Manager
The DataManager class provides a unified interface for accessing all data sources:
from memories.data_acquisition.data_manager import DataManager
import asyncio
from datetime import datetime, timedelta
# Initialize data manager with cache directory
data_manager = DataManager(cache_dir="./data_cache")
# Define area of interest (San Francisco)
bbox = {
'xmin': -122.4018,
'ymin': 37.7914,
'xmax': -122.3928,
'ymax': 37.7994
}
# Define time range (last 30 days)
end_date = datetime.now()
start_date = end_date - timedelta(days=30)
async def get_comprehensive_data():
# Get satellite data
satellite_data = await data_manager.get_satellite_data(
bbox_coords=bbox,
start_date=start_date.strftime("%Y-%m-%d"),
end_date=end_date.strftime("%Y-%m-%d")
)
# Get vector data
vector_data = await data_manager.get_vector_data(
bbox=bbox,
layers=["buildings", "roads", "landuse"]
)
# Prepare comprehensive training data
training_data = await data_manager.prepare_training_data(
bbox=bbox,
start_date=start_date.strftime("%Y-%m-%d"),
end_date=end_date.strftime("%Y-%m-%d"),
satellite_collections=["sentinel-2-l2a"],
vector_layers=["buildings", "roads"],
cloud_cover=20.0
)
return {
"satellite": satellite_data,
"vector": vector_data,
"training": training_data
}
# Run the async function
results = asyncio.run(get_comprehensive_data())
# Process results
print(f"Satellite data: {len(results['satellite']['scenes'])} scenes")
print(f"Vector data: {len(results['vector']['features'])} features")
print(f"Training data: {results['training']['status']}")
Example Output:
Satellite data: 3 scenes
Vector data: 1245 features
Training data: success
Advanced Usage: Multi-Source Data Fusion
Combining multiple data sources for comprehensive analysis:
from memories.data_acquisition.data_manager import DataManager
from memories.data_acquisition.processors import DataFusion
import asyncio
data_manager = DataManager(cache_dir="./data_cache")
fusion_processor = DataFusion()
async def perform_data_fusion():
# Get data from multiple sources
satellite_data = await data_manager.get_satellite_data(
bbox_coords=bbox,
start_date="2024-01-01",
end_date="2024-02-01"
)
vector_data = await data_manager.get_vector_data(
bbox=bbox,
layers=["buildings", "roads"]
)
# Perform data fusion
fused_data = fusion_processor.fuse(
primary=satellite_data,
secondary=vector_data,
method="overlay",
resolution=10 # meters per pixel
)
# Extract insights
insights = fusion_processor.analyze(
fused_data,
metrics=["urban_density", "vegetation_health"]
)
return insights
# Run the fusion process
insights = asyncio.run(perform_data_fusion())
# Display insights
for metric, value in insights.items():
print(f"{metric}: {value}")
Example Output:
urban_density: 78.3%
vegetation_health: Good (NDVI: 0.68)
Sentinel API
The SentinelAPI class provides direct access to Sentinel-2 data:
from memories.data_acquisition.sources.sentinel_api import SentinelAPI
from datetime import datetime, timedelta
import asyncio
import matplotlib.pyplot as plt
import numpy as np
# Initialize Sentinel API
api = SentinelAPI(data_dir="./sentinel_data")
async def download_and_visualize():
# Initialize the client
await api.initialize()
# Define area of interest
bbox = {
'xmin': -122.4018,
'ymin': 37.7914,
'xmax': -122.3928,
'ymax': 37.7994
}
# Define date range
start_date = datetime.now() - timedelta(days=30)
end_date = datetime.now()
# Download specific bands with cloud cover filter
result = await api.download_data(
bbox=bbox,
start_date=start_date,
end_date=end_date,
bands=["B04", "B08"], # Red and NIR bands
cloud_cover=10.0
)
if result["status"] == "success":
# Calculate NDVI (if both red and NIR bands are available)
if "B04" in result["bands"] and "B08" in result["bands"]:
red_band = np.load(f"./sentinel_data/{result['scene_id']}_B04.npy")
nir_band = np.load(f"./sentinel_data/{result['scene_id']}_B08.npy")
# Calculate NDVI
ndvi = (nir_band - red_band) / (nir_band + red_band)
# Plot NDVI
plt.figure(figsize=(10, 8))
plt.imshow(ndvi, cmap='RdYlGn', vmin=-1, vmax=1)
plt.colorbar(label='NDVI')
plt.title(f"NDVI - {result['metadata']['acquisition_date']}")
plt.savefig("./sentinel_data/ndvi_visualization.png")
return {
"status": "success",
"ndvi_mean": float(np.mean(ndvi)),
"ndvi_max": float(np.max(ndvi)),
"visualization": "./sentinel_data/ndvi_visualization.png"
}
return result
else:
return result
# Run the download and visualization
result = asyncio.run(download_and_visualize())
# Display results
if result["status"] == "success":
print(f"Downloaded bands: {result.get('bands', [])}")
print(f"Scene ID: {result.get('scene_id', '')}")
print(f"Cloud cover: {result.get('cloud_cover', 0)}%")
if "ndvi_mean" in result:
print(f"Mean NDVI: {result['ndvi_mean']:.2f}")
print(f"Max NDVI: {result['ndvi_max']:.2f}")
print(f"Visualization saved to: {result['visualization']}")
else:
print(f"Error: {result.get('message', 'Unknown error')}")
Example Output:
Downloaded bands: ['B04', 'B08']
Scene ID: S2A_MSIL2A_20240215T184721_N0509_R113_T10SEG_20240215T221514
Cloud cover: 5.2%
Mean NDVI: 0.42
Max NDVI: 0.89
Visualization saved to: ./sentinel_data/ndvi_visualization.png
ENVIRONMENTAL METRICS:
-----------------
- Vegetation Index (NDVI): 0.68 (Healthy)
- Urban Density: 78% (High)
- Water Bodies: 22% of area
- Cloud Cover: 5%
DETECTED FEATURES:
------------------
- Buildings: 1,245 structures
- Roads: 87 km total length
- Water: San Francisco Bay
- Parks: Golden Gate Park, other green spaces
Change Detection Results
Vegetation: +2.3% increase
Urban Area: +0.8% increase
Key Insights
Healthy vegetation in park areas
Moderate urban development in northern sectors
Environmental Impact
Reduced heat island effect
Improved air quality
Enhanced ecosystem resilience
Recommendations
Expand green infrastructure
Optimize urban density
Implement climate adaptation measures
Error Handling
The data acquisition components include robust error handling:
from memories.data_acquisition.data_manager import DataManager
import asyncio
data_manager = DataManager(cache_dir="./data_cache")
async def handle_data_errors():
try:
# Try with invalid bbox
invalid_bbox = {
'xmin': 200, # Invalid longitude
'ymin': 37.7914,
'xmax': -122.3928,
'ymax': 37.7994
}
result = await data_manager.get_satellite_data(
bbox_coords=invalid_bbox,
start_date="2024-01-01",
end_date="2024-02-01"
)
except ValueError as e:
print(f"Validation error: {e}")
except ConnectionError as e:
print(f"Connection error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
finally:
print("Error handling complete")
# Run the error handling example
asyncio.run(handle_data_errors())
Best Practices
Efficient Data Acquisition: - Use the smallest possible bounding box for your area of interest - Request only the bands you need - Set appropriate cloud cover thresholds (10-20% recommended) - Use the cache system to avoid redundant downloads
Asynchronous Operations: - All data acquisition methods are asynchronous - Use asyncio.gather() for concurrent downloads - Handle exceptions properly in asynchronous code
Resource Management: - Clean up temporary files when no longer needed - Monitor disk usage when downloading large datasets - Consider using cloud storage for large-scale operations