dvc.org
dvc.org copied to clipboard
images: how to improve searchability within site?
We have many figures and other images in our docs. Notably the ones in Use Cases have a good amount of work behind them. Often these are reused in 3rd party publications, presentations, etc. But going backwards -- finding the use case or other doc from the image -- can be difficult to impossible. Example:
Searching "Shared Cache" in our docs engine finds pages, but none have an image in them. No other phrases from the image come close to finding the source page or at least specifically relevant content.
In fact this image (used in our courses) is a reorganized version of what's in https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server but figuring that out would be almost impossible without prior knowledge about it.
- [ ] What other images have this problem?
- [ ] How can we fix it?
Cc (brought up by) @jendefig
It's a research problem. A few (thousand?) Ph.Ds can be written. May I delve into this? :)
Kidding aside, manual tagging or an image gallery for internal consumption may be good. We could keep higher resolution/editable versions of these files that way as well.
Maybe adding custom keywords to <img alt=
will fix this for our internal content search box? Would need to ask web team cc @julieg18. Same question for search engines, actually. I'm not sure.
The other approach is to use different words in the images themselves but that could be redundant if the context already has the same text titles/ captions.
Maybe adding custom keywords to <img alt= will fix this for our internal content search box? Would need to ask web team cc @julieg18. Same question for search engines, actually. I'm not sure.
I don't think our internal seach box scans image alts but Google does scan image alts along with the physical text in a website. For example, I can find the image with the just the alt for one of the images in the page you mentioned:
But its bad practice to just place keywords inside image alts. Image alts are supposed to be a a description of the image itself. It would be better if we updated our image alts to be actual descriptions of the images, including the keywords inside the alt descriptions.
cc @iterative/websites
Hmmm yeah maybe it's not that hard to find them, on search engines at least. Did you try image search @jendefig ?
It may be possible to run an OCR on the images and associate the images with the words found on them. It's a fairly straightforward process. It can even be an example project for pipelines, e.g. searching a set of images on S3 by the words found on them. We can run this time to time to classify the images we have.
Hmmm yeah maybe it's not that hard to find them, on search engines at least. Did you try image search @jendefig ?
If this is the case for all that's good, then it just needs to become properly searchable within our own website. But is it possible to work to showing up with this search? (ie not including DVC, I've heard several community members say they found our site because they googled what they were looking for - not through some other word of mouth). Would be great if images could be found in that way too?
What shows up when you click on images from same search
needs to become properly searchable within our own website
Renamed and relabeled the issue then. I'm not sure how to best approach that. It's a technical question for @iterative/websites, I think.
is it possible to work to showing up with this search
Yes but it seems difficult, "machine learning storage layers" is a broad idea and the Internet is a huge place. A SEO specialist could come up with a strategy but Idk if it's worth the investment. Or in theory once DVC grows a lot more and our site becomes an even more authoritative source, we'll naturally start to rank on that search. Maybe we do already but not in the first page (you can find our with https://search.google.com/search-console 🙂
is it possible to work to showing up with this search Yes but it seems difficult
p.s. there are basic improvements we can try like making sure the phrase or at least its key words are part of the text in that page (it's missing "storage layers", actually).
you can find our with https://search.google.com/search-console
I checked and we do not rank on that query ATM.