google-cloud-php
google-cloud-php copied to clipboard
Storage: Bucket List Contents
Hi,
I am trying to index content in a bucket, as I am using buckets to mimic a file system for our application. Looks like this:
011 > G001_1 >G001_2 >G001_3 >G001_4
And to index, I have some helper functions after this, but to get the paths as strings in order to to build the php data structure:
<?php
$bucket = $storage->bucket($bucket);
$objects = $bucket->objects([
'fields' => 'items',
'prefix' => ''
]);
$arr = [];
foreach ($objects as $object) {
array_push($arr,$object->name());
}
return $arr;
?>
However in my case, $arr contains only all the paths in a single "folder", G001_1, running this several times sometimes it will list contents of the other folders but sometimes not. I was wondering if perhaps there is some request size limit that I am missing, or perhaps I am misinterpreting what the structure of objects is? Originally as I was uploading is when I would get mixed results, and I was thinking perhaps I am getting mixed results due to some sort of file lock on the folders, but now that the upload is complete that does not seem to be the case.
Also if there is a better way to do this so that I do not have to build out the whole structure but can just get immediate directories and perhaps their subdirectories, would love to get a suggestion!
Hey @1tron1! The objects list documentation sheds some light on how you can use prefix
and delimiter
to mimic the file system.
A quick example using the library
Given the following structure in a single bucket:
- directoryA
- fileA
- fileB
- directoryA
- fileA
- directoryB
- fileA
- fileA
// First let's take a look at the root
$objects = $bucket->objects([
'prefix' => '',
'delimiter' => '/'
]);
foreach ($objects as $object) {
var_dump($object->name()); // Outputs names: "fileA"
}
var_dump($objects->prefixes()); // Outputs 'subdirectories': "directoryA/", "directoryB/"
// And now let's take a peek at what is in "directoryA"
$objects = $bucket->objects([
'prefix' => 'directoryA/',
'delimiter' => '/'
]);
foreach ($objects as $object) {
var_dump($object->name()); // Outputs names: "directoryA/", "directoryA/fileA", "directoryA/fileB"
}
var_dump($objects->prefixes()); // Outputs 'subdirectories': "directoryA/directoryA/"
running this several times sometimes it will list contents of the other folders but sometimes not
Could you expand on this a bit more? I have not seen anything like this in my testing, maybe I am just misunderstanding :). Are you using multiple buckets as directories or a single bucket with the directory structure under that?
@dwsupplee Thanks for the info, this makes much more sense.
With regards to getting different results, I only had that problem while running the 'indexer' during the upload process, I seemingly would get back only the directory that was being uploaded at the time, not a major concern as I see now that I was navigating the structure incorrectly.
I did build some helper functions that might be useful but in the process ran into a quirk, and can put them into the folder upload PR if they might seem useful (I have to work out some of the minutiae cases but its essentially working, except for the weird behavior mentioned after the code)
<?php
require_once 'vendor/autoload.php';
use Google\Cloud\Storage\StorageClient;
///////////////////////////////////////////
///// UTILS //////
/////////////////////////////////////////
// Get Objects in Path
function get_objects($bucket,$path=[]){
$path_str = implode("/",$path)."/";
if($path = []){
$path_str = "";
}
$objects = $bucket->objects([
'prefix' => $path_str,
'delimiter' => '/'
]);
return $objects;
}
// Generate File List
function files_from_objects($objects){
$arr = [];
foreach($objects as $os){
array_push($arr,$os->name());
}
return $arr;
}
// Generate Folder List
function folders_from_objects($objects){
return $objects->prefixes();
}
// Combined directory list
function ls($objects){
$fil = files_from_objects($objects);
$dir = folders_from_objects($objects);
$resarray_merge($dir,$fil);
$k = []
foreach($res as $r){
if(substr_count($r,"/")>1){
array_push($k,$r);
}
}
return $k;
}
///////////////////////////////////////////
///// SCRIPT //////
/////////////////////////////////////////
$projectId = 'projectid';
# Instantiates a client
$storage = new StorageClient([
'projectId' => $projectId
]);
$bucket = $storage->bucket('bucket');
$objects = get_objects($bucket,['root_directory']);
echo "ls". json_encode(ls($objects));
echo PHP_EOL;
echo "files_from_objects" . json_encode(files_from_objects($objects));
echo PHP_EOL;
echo "folders_from_objects".json_encode(folders_from_objects($objects));
echo PHP_EOL;
echo "ls".json_encode(ls($objects));
echo PHP_EOL;
You would get what you expect,namely that the two calls to ls have the same result, but when you reverse the order in ls( folders_from_objects before files_from objects) you get different results for the different ls calls, I will just use the working order for now, but could be something wrong with objects
and can put them into the folder upload PR if they might seem useful
First off, you're a hero for contributing :) - but a separate PR for each feature would be ideal.
but when you reverse the order in ls( folders_from_objects before files_from objects) you get different results
This is due to the fact prefixes
are not completely populated until you have fully iterated the results. It will be important to maintain the order you have designated.
Okay, thanks for the info, I'll close this issue when I add the PR for this.
Hey there @1tron1, just a friendly ping. Is this something you were still thinking of contributing?
@dwsupplee yes, I can provide this. Thanks for reminding me, have been very busy.
@dwsupplee, I correctly understood that in this issue you need to add the code analog @1tron1 to add to the Storage API? For example, like this:
class BucketFileSystem {
const SEPARATOR = '/';
private $bucket, $path, $objects, $objNames, $prefixes;
public function __construct(Bucket $bucket, $path=[]) {
$this->bucket=$bucket;
$this->path=$path;
$this->refresh();
}
private function refresh(){
$this->objects = $this->bucket->objects([
'prefix' => count($this->path) ? implode(self::SEPARATOR, $this->path) : '',
'delimiter' => self::SEPARATOR
]);
$this->objNames=[];
foreach($this->objects as $obj){
$this->objNames[]=$obj->name();
}
$this->prefixes=$objects->prefixes();
}
public function folders(){
return $this->prefixes;
}
public function files(){
return $this->objNames;
}
public function ls(){
$combined_list= array_merge($this->folders(), $this->files());
$k=[];
foreach($combined_list as $r){
if(substr_count($r,self::SEPARATOR)>1){
$k[]=$r;
}
}
return $k;
}
}
Rewrited code (using new class):
$projectId = 'projectid';
$storage = new StorageClient(['projectId' => $projectId]);
$fs= new BucketFileSystem(
$storage->bucket('bucket'),
['root_directory']
);
echo "ls: ". json_encode($fs->ls()), PHP_EOL;
echo "files_from_objects: ". json_encode($fs->files()), PHP_EOL;
echo "folders_from_objects: ". json_encode($fs->files()), PHP_EOL;
Hi, I am in favor of closing this issue.
Although, it works for this use-case, BucketFileSystem
as implemented above is likely not very helpful for all users (repeated api calls, caching, very large buckets, etc) and we do not see a lot of value in making this change specific to the PHP library.
There are other optimal ways to have a file system which mirrors GCS such as gcloud storage cli.
Its good to keep this contribution for future references. Thanks for contributions.