dvc.org icon indicating copy to clipboard operation
dvc.org copied to clipboard

images: how to improve searchability within site?

Open jorgeorpinel opened this issue 3 years ago • 11 comments

We have many figures and other images in our docs. Notably the ones in Use Cases have a good amount of work behind them. Often these are reused in 3rd party publications, presentations, etc. But going backwards -- finding the use case or other doc from the image -- can be difficult to impossible. Example:

image

Searching "Shared Cache" in our docs engine finds pages, but none have an image in them. No other phrases from the image come close to finding the source page or at least specifically relevant content.

In fact this image (used in our courses) is a reorganized version of what's in https://dvc.org/doc/use-cases/fast-data-caching-hub#example-shared-development-server but figuring that out would be almost impossible without prior knowledge about it.

  • [ ] What other images have this problem?
  • [ ] How can we fix it?

jorgeorpinel avatar Jan 11 '22 05:01 jorgeorpinel

Cc (brought up by) @jendefig

jorgeorpinel avatar Jan 11 '22 05:01 jorgeorpinel

It's a research problem. A few (thousand?) Ph.Ds can be written. May I delve into this? :)

Kidding aside, manual tagging or an image gallery for internal consumption may be good. We could keep higher resolution/editable versions of these files that way as well.

iesahin avatar Jan 11 '22 13:01 iesahin

Maybe adding custom keywords to <img alt= will fix this for our internal content search box? Would need to ask web team cc @julieg18. Same question for search engines, actually. I'm not sure.

jorgeorpinel avatar Jan 11 '22 16:01 jorgeorpinel

The other approach is to use different words in the images themselves but that could be redundant if the context already has the same text titles/ captions.

jorgeorpinel avatar Jan 11 '22 16:01 jorgeorpinel

Maybe adding custom keywords to <img alt= will fix this for our internal content search box? Would need to ask web team cc @julieg18. Same question for search engines, actually. I'm not sure.

I don't think our internal seach box scans image alts but Google does scan image alts along with the physical text in a website. For example, I can find the image with the just the alt for one of the images in the page you mentioned:

image

But its bad practice to just place keywords inside image alts. Image alts are supposed to be a a description of the image itself. It would be better if we updated our image alts to be actual descriptions of the images, including the keywords inside the alt descriptions.

cc @iterative/websites

julieg18 avatar Jan 11 '22 17:01 julieg18

Hmmm yeah maybe it's not that hard to find them, on search engines at least. Did you try image search @jendefig ?

image

jorgeorpinel avatar Jan 11 '22 18:01 jorgeorpinel

It may be possible to run an OCR on the images and associate the images with the words found on them. It's a fairly straightforward process. It can even be an example project for pipelines, e.g. searching a set of images on S3 by the words found on them. We can run this time to time to classify the images we have.

iesahin avatar Jan 12 '22 11:01 iesahin

Hmmm yeah maybe it's not that hard to find them, on search engines at least. Did you try image search @jendefig ?

If this is the case for all that's good, then it just needs to become properly searchable within our own website. But is it possible to work to showing up with this search? (ie not including DVC, I've heard several community members say they found our site because they googled what they were looking for - not through some other word of mouth). Would be great if images could be found in that way too? image

jendefig avatar Jan 13 '22 22:01 jendefig

What shows up when you click on images from same search image

jendefig avatar Jan 13 '22 22:01 jendefig

needs to become properly searchable within our own website

Renamed and relabeled the issue then. I'm not sure how to best approach that. It's a technical question for @iterative/websites, I think.

is it possible to work to showing up with this search

Yes but it seems difficult, "machine learning storage layers" is a broad idea and the Internet is a huge place. A SEO specialist could come up with a strategy but Idk if it's worth the investment. Or in theory once DVC grows a lot more and our site becomes an even more authoritative source, we'll naturally start to rank on that search. Maybe we do already but not in the first page (you can find our with https://search.google.com/search-console 🙂

jorgeorpinel avatar Jan 14 '22 01:01 jorgeorpinel

is it possible to work to showing up with this search Yes but it seems difficult

p.s. there are basic improvements we can try like making sure the phrase or at least its key words are part of the text in that page (it's missing "storage layers", actually).

you can find our with https://search.google.com/search-console

I checked and we do not rank on that query ATM.

jorgeorpinel avatar Jan 14 '22 01:01 jorgeorpinel