Caching¶
Interacting with files on a cloud provider can mean a lot of waiting on files downloading and uploading. cloudpathlib
provides seamless on-demand caching of cloud content that can be persistent across processes and sessions to make sure you only download or upload when you need to.
Are we synced?¶
Before cloudpathlib
, we spent a lot of time syncing our remote and local files. There was no great solution. For example, I just need one file, but I only have a script that downloads the entire 800GB bucket (or worse, you can't remember exactly which files you need 🤮). Or even worse, you have all the files synced to your local machine, but you suspect that some are are up-to-date and some are stale. More often that I'd like to admit, the simplest answer was to blast the whole data directory and download all over again. Bandwidth doesn't grow on trees!
Cache me if you can¶
Part of what makes cloudpathlib
so useful is that it takes care of all of that, leaving your precious mental resources free to do other things! It maintains a local cache and only downloads a file if the local version and remote versions are out of sync. Every time you read or write a file, cloudpathlib
goes through these steps:
- Does the file exist in the cache already?
- If no, download it to the cache.
- If yes, does the cached version have the same modified time as the cloud version?
- If it is older, re-download the file and replace the old cached version with the updated version from the cloud.
- If the local one is newer, something is up! We don't want to overwrite your local changes with the version from the cloud. If we see this scenario, we'll raise an error and offer some options to resolve the versions.
Supporting reading and writing¶
The cache logic also support writing to cloud files seamlessly in addition to reading. We do this by tracking when a CloudPath
is opened and on the close of that file, we will upload the new version to the cloud if it has changed.
Warning we don't upload files that weren't opened for write by cloudpathlib
. For example, if you edit a file in the cache manually in a text edior, cloudpathlib
won't know to update that file on the cloud. If you want to write to a file in the cloud, you should use the open
or write
methods, for example:
with my_cloud_path.open("w") as f:
f.write("My new text!")
This will download the file, write the text to the local version in the cache, and when that file is closed we know to upload the changed version to the cloud.
As an example, let's look at using the Low Altitude Disaster Imagery open dataset on S3. We'll view one images available of a flooding incident available on S3.
from cloudpathlib import CloudPath
from itertools import islice
ladi = CloudPath("s3://ladi/Images/FEMA_CAP/2020/70349")
# list first 5 images for this incident
for p in islice(ladi.iterdir(), 5):
print(p)
s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0002_a89f1b79-786f-4dac-9dcc-609fb1a977b1.jpg s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0003_02c30af6-911e-4e01-8c24-7644da2b8672.jpg s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0004_d37c02b9-01a8-4672-b06f-2690d70e5e6b.jpg s3://ladi/Images/FEMA_CAP/2020/70349/DSC_0005_d05609ce-1c45-4de3-b0f1-401c2bb3412c.jpg
Just because we saw these images are available, it doesn't mean we have downloaded any of this data yet.
# Nothing in the cache yet
!tree {ladi.fspath}
/var/folders/8g/v8lwvfhj6_l6ct_zd_rs84mw0000gn/T/tmpn5oh1rkm/ladi/Images/FEMA_CAP/2020/70349 [error opening dir] 0 directories, 0 files
Now let's look at just the first image from this dataset, confirming that the file exists on S3.
flood_image = ladi / "DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg"
flood_image.exists()
True
# Still nothing in the cache
!tree {ladi.fspath}
/var/folders/8g/v8lwvfhj6_l6ct_zd_rs84mw0000gn/T/tmpn5oh1rkm/ladi/Images/FEMA_CAP/2020/70349 [error opening dir] 0 directories, 0 files
Even though we refer to a specific file and make sure it exists in the cloud, we can still do all of that work without actually downloading the file.
In order to read the file, we do have to download the data. Let's actually display the image:
%%time
%matplotlib inline
import matplotlib.pyplot as plt
from PIL import Image
with flood_image.open("rb") as f:
i = Image.open(f)
plt.imshow(i)
CPU times: user 1.35 s, sys: 435 ms, total: 1.78 s Wall time: 1.76 s
# Downloaded image file in the cache
!tree {ladi.fspath}
/var/folders/8g/v8lwvfhj6_l6ct_zd_rs84mw0000gn/T/tmpn5oh1rkm/ladi/Images/FEMA_CAP/2020/70349 └── DSC_0001_5a63d42e-27c6-448a-84f1-bfc632125b8e.jpg 0 directories, 1 file
Just by using open
, we've downloaded the file in the background to the cache. Now that it is local, we won't redownload that file unless it changes on the server. We can confirm that by checking if the file is faster to read a second time.
%%time
with flood_image.open("rb") as f:
i = Image.open(f)
plt.imshow(i)
CPU times: user 233 ms, sys: 69.7 ms, total: 303 ms Wall time: 491 ms