Upgrade SOLR - currently documented versions are End of Life
I'm concerned that following our CKAN install instructions result in installing old versions of SOLR, which are not supported by security patches any more. This has been flagged a few times, but I thought it worth collecting the key info together to help agree a way forward.
SOLR considers the last two major versions are good. At time of writing that's 8.x and 7.7.x. Older versions are 'End of Life' - i.e. no support.
CKAN's use of SOLR, using instructions Install from source and CKAN's Install from package:
- Ubuntu 18.04 - solr-jetty includes SOLR 3.6.2, released 2013-01-16
- Ubuntu 16.04 - solr-jetty includes SOLR 3.6.2, released 2013-01-16
- Ubuntu 14.04 - is now into [ESM], requires payment for security patches, so we should probably stop supporting that now anyway
I think in reality most production installs of CKAN deviate from these instructions, using more recent docker images or installing later SOLR versions, so there is not a security panic. However different people take different approaches, the tech team receive lots of requests for help, tips are scattered across lots of different locations, and so it would be helpful if we centrally document a good approach.
- SOLR 3.6.2 - @boykoc has a hack to fix the solr-jetty problem: https://github.com/ckan/ckan/issues/4762
- SOLR 6.5 - @boykoc has an install recipe: https://github.com/ckan/ckan/issues/4916#issuecomment-515564751
- SOLR 6.5 - @jakubklimek has a similar install recipe https://github.com/ckan/ckan/wiki/Install-and-use-Solr-6.5-with-CKAN
- SOLR 7 - @smotornyuk had a PR for SOLRv7 schema https://github.com/ckan/ckan/pull/4387 although that has been dropped now in favour of a SOLRv8 version
- SOLR 8.4 - @smotornyuk has done a PR for SOLR 8.4 https://github.com/ckan/ckan/pull/5143 although he notes it is experimental and potential issues with ckanext-spatial
@TkTech suggests each CKAN version supports the two latest/supported SOLR versions, which I think sounds sensible.
Because Ubuntu's packages are always behind SOLR's, @TkTech suggests we document installing from source a recommended recent SOLR version. We might bear in mind that compared to the single command sudo apt-get install solr-jetty, installing from source needs a bit more, according to https://www.digitalocean.com/community/tutorials/how-to-install-solr-on-ubuntu-14-04 and https://lucene.apache.org/solr/guide/6_6/taking-solr-to-production.html
- download and untarring the jar
Then either use
sudo bash ./install_solr_service.sh solr-x.y.z.tgzor manually: - create /opt/solr/etc/jetty-logging.xml, /etc/init.d/jetty
- create the solr user and home directory (/opt/solr) (not counting editing /etc/default/jetty which we document already) I guess we should suggest using install_solr_service.sh, and advanced users can always revert to the manual steps if they see fit.
I personally would love to have some actual repository in the official docs for installation. We install solr by hand (or in our case with ansible) but keeping it up date is manual work as one can't just command apt-get upgrade. But this is more of a nitpicking as such repository might not exist.
I couldn't see a PPA repo package for it anywhere - everyone just uses the tar it seems.
It's a good point about what you lose by not using OS packaging. Patching is annoying to do but surely just getting the tar, unzipping alongside the old version and changing the symbolic link to the new version?
The other thing you lose by not being on apt-get is being reminded to patch. So I'd be interested if anyone has any good ideas about that. Someone operating a CKAN will want to be patching CKAN itself, they can get GitHub to remind them about security updates needed to python libraries, and use apt-get to update their OS libraries, but you're right that if we install SOLR in a tar then you won't get alerts about that. Or maybe everyone is using docker containers in production, and this would be caught when your daily CI's docker build is uploaded to a container register, which tend to do a security scan?
I wondered if patching is a big deal? And it's not maybe not so bad. If we look at SOLR 6.6 there were 5 patch updates over 2 years, and 4 of those were due to CVEs:
- 6.6.0 Kerberos token
- 6.6.1 XXE & RCE through XML Query Parser
- 6.6.2 XXE through dataConfig request parameter
- 6.6.5 Request deserialization error However most of these would require the SOLR URL to be on the internet. But there was one in recent years that was a mild concern to CKAN and we put something in CKAN itself to mitigate. So maybe patching SOLR is not worth worrying about too much.
I've mentioned this elsewhere over the years, but we've had many stackoverflow and mailing list questions over the years about out of date or confusing documentation, because it's simply not possible for us to keep every combination of distro and version up to date.
I want to strip down our documentation and reference the official Postgres and Solr documentation for installation instructions instead. Providing installation documentation for all the available platforms for solr/postgres is their job, not ours. All our documentation should say is the minimum supported version. This is much more in line with the vast, vast majority of other projects.
It's important to remember that the vast majority of CKAN installations are governments (and usually small teams within them), and a large % of them are using (relatively) very old versions of distros such as CentOS/Redhat that have exceptionally slow package updates.
It's also important to keep in mind that the reality of our user base is that outside of the tech-savvy users like data.gov, most CKAN installations are very rarely updating. There are still 50+ sites running CKAN 2.2. Most of these sites are running archaic versions of apache or nginx with outstanding CVEs, let alone CKAN/solr/postgres. They usually get a budget to get the site running, no budget to maintain it, and then a little budget to update it when something embarrassing happens like a homepage defacement. These users are the lowest common denominator we have target with our current approach to documentation.
Instead, our documentation should focus on supporting technical users, and provide more direction for others to find managed CKAN services.
My ideal timeline:
- Decide on a supported version policy (my suggestion was the last 2 versions at the time of release)
- Remove the current documentation on solr and postgres, point to the official solr and postgres installation documentation instead (which may very well just be an apt-get install!)
- Include version checking functionality in CKAN core, which provides a notice to sysadmin users when there's a new CKAN update available. This nag has proved successful in wordpress, drupal, JIRA, etc...
- Make the opening documentation clearer and immediately direct non-technical users to hosted CKAN providers. We should be strongly recommending using a managed service to users that do not have a team available to run it properly, which at this point is practically all public CKAN installations.
Lots of useful ideas here @TkTech. However I'm not convinced about just point anyone at the core docs for Postgres and SOLR. I think Postgres and SOLR do a pretty bad job at explaining how to install imho.
Postgres installs fine in our instructions on this line:
sudo apt-get install python-dev postgresql libpq-dev python-pip python-virtualenv git-core solr-jetty openjdk-8-jdk redis-server
You don't have to lift the lid and learn all about postgres. I for one are very happy to have that simplification.
Yet postgresql's docs, for Ubuntu, have 3 screens worth. Literally nowhere does it say sudo apt-get install postgres and it is mostly irrelevant information to most of our installing users, technical or otherwise. Does CKAN need the enterprisedb thingy? The client library package as well? At least the building from source is hidden for once. https://www.postgresql.org/download/linux/ubuntu/
And SOLR's instructions focus on running it in a command-prompt. Unless I've missed something there seems to be no appreciation of the things you need to run it on a server, like starting on boot, logging config, running it as a different user.
I think we should document a few lines that quickly install SOLR, which is valuable to all our technical users. Even if they are not on Ubuntu then it's going to be a better place to start than the official docs.
You say that we struggle to document all combinations of distros and versions and I feel that. But I think our original intention is to just document the Ubuntu path, and maybe we should cut that down to a single version of Ubuntu, to simplify the docs.
It's not the install of these things that takes up lots of the docs in general - it is the configuration of them. This seems useful to document. And then there's reminders about how to restart them etc, which again seems pertinent and useful to even experienced operators.
I'm happy to be put straight on any of this :)
I love the idea to add a notice for sysadmins when a new ckan version is available. If we run the service that reports latest versions ourselves this would also give us some real data on how many ckan sites are out there.