-
Notifications
You must be signed in to change notification settings - Fork 40
Allow CEPH OSD to use devices on the systems through #158
Conversation
values.yml or cli override - still need docs
v1k0d3n
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for submitting a PR! we're going to need a lot of expertise with ceph in the near future.
| privileged: true | ||
| env: | ||
| - name: CEPH_DAEMON | ||
| value: osd_directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a boolean conditional used for this approach. this allows the deployment to be more flexible. this is partially our fault...we are in the process of defining where we would want conditionals. docs take a while, but will really help new developers. we need to define still what this ceph conditional would be. let me nail this down with @alanmeadows submit in the PR notes.
Example of conditional in deployment:
example: https://github.com/att-comdev/openstack-helm/blob/master/mariadb/templates/deployment.yaml#L157-L172
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily a problem, but there are more the two ceph osd options, I think. Now we may want validators that will fail (if that is possible in helm).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree. this is something we have discuss as well; "what do we do when there are a, b, c, etc options". this is something we need to lay out in docs for developers who are contributing. we're at that point now, and we're taking just a little step back from developing so we can layout clear, sane goals for developers to follow. to this point in our two month project we've written issues and folks have plucked out issues they know we want to collectively solve, but this is changing now that we're trying to find a permanent home for the project (i want these docs drafted now).
| be overridden by adding additional set parameters. This will tell each osd to use physical disks | ||
| /dev/sdb and /dev/sdc as drives. | ||
| ``` | ||
| admin@kubenode01:~$ helm install --set network.public=$osd_public_network,osd.daemon="osd",osd.zap="1",ods.disks="0:sdb 1:sdc" --name=ceph local/ceph --namespace=ceph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice. i like the override option. ceph is going to be one of those charts we're going to need a lot of conditionals and it will end up very different than it stands today!
alanmeadows
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good "real world" enhancement, but I think we need to think through how single-host + multi-disk should be handled - one OSD per disk, and get helm to manage that. Will that fit within the existing daemonset model or do we need to step outside that to support this?
An armchair idea we had was what would be required is:
hostname:
/dev/sdc
/dev/sdd
and so on, as values input, allowing helm to create host-locked (node labels pinned to specific hosts, with specific physical disks fed into that specific deployment.
| value: osd_directory | ||
| value: {{ .Values.osd.daemon }} | ||
| # Type, Disks, and Zap are ignored for directory-based OSDs. | ||
| - name: OSD_TYPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do any of these need to be conditional if .Values.osd.daemon is osd_directory. It seems like they should even if just from a clarity perspective--e.g.
{{- if not eq .Values.osd.daemon "osd_directory" }}
...
{{- end }}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The entrypoint.sh script doesn't, but it would be a safety thing more than anything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for safety when we talk about zapping.
| # Change to osd to use reall disks and set zap to 1 | ||
| daemon: osd_directory | ||
| type: devices | ||
| disks: "0:sdb 1:sdc 2:sdd 3:sde" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to examine this approach more closely -- isn't the ceph best practice recommendation to leverage one osd per physical disk?
Helm can be used to iterate over the disks, creating an OSD per disk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what the entrypoint script does when you specify the TYPE Devices. It actually launches and aggregates the logging for each osd. It is all in the single container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Offhand I think it would be better to launch a container per device, rather than rely on the launching of multiple processes that entrypoint.sh does. The suggestion Alan has made from his arm-chair would be worth exploring here I think. The primary downside I see here would be the need to specify the OSD/Host topology at chart launch rather than have a somewhat organic approach to deployment, but there I'm on the fence as to whether this constraint is actually a limitation or a sensible belt and braces approach to ensuring our cluster does what we expect.
If we explored this option (and from that decide to go down this path) it would be fantastic to try and get a somewhat standardised format for this values specification between different storage backends, as I believe Gluster may use a similar pattern currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for articulating this at the chart level would be two fold -- one is that we do what ceph wants us to do -- one osd process per physical device. The second is that its very clear what is going to happen at the expense of forcing the operator to be verbose. I think this is a must when we talk about "zapping" disks or start working through workflows on physical host disk replacement. Once a container is running multiple processes, you've just given up several kubernetes process management resiliency features, such as restarting the container when the process dies. I think supervisor and other intermediaries (entrypoint.sh spawning multiple processes) need to have good reason for being in the middle.
If we find that this causes tremendous operational pain, I'm willing to entertain having entrypoint.sh manage the processes but I think we should try to do the right thing first and see what issues we uncover.
I think ending up with chart overrides that look like:
osd_map:
$hostname:
device_regexp: /dev/sd[e-z]+ # may prove difficult to allow helm to iterate to create container:osd definitions but would be convenient
or more likely, to enable helm to iterate over the list creating an osd container per device:
osd_map:
$hostname:
device_list:
- /dev/sdb
- /dev/sdc
...
Is not really requiring the operator to go to extreme lengths and it supports varying disk capabilities for each host. Also, if there is a discrepancy between what the operator expects to be on the host and what we're able to launch an OSD against, it is immediately realized. For someone doing this at scale, that is an important distinction. I believe an entrypoint.sh may likely hide an issue such as having six hosts each with six disks, and one of them only having four online, for instance.
This also is further complicated if operators want to dedicated some devices (e.g. SSDs) to journals as they need to help us identify those.
I would also add we should think how we can support /dev/disk-by* type identification/wildcards for this as opposed to spelling out actual device paths such as /dev/sdb. In other words, the operator may know each of his hosts have two SSDs for journaling, but they may not be equivalent /dev/sdX on all hosts.
Even if the gluster use case doesn't require anything like an "osd per device", the mapping of disks to rope in for dedicated storage is likely similar so this has value to work through for any storage backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanmeadows - I agree, but I was trying to get something to play with quicker to get better feedback on the actual problem space. I also agree that the way ceph chooses to run multiple OSD processes in the same container misses out on some of the k8s/container value, but it does understand the nuances of running multiple OSD process on the nodes (like port assignment and rotation).
What you describe is fine and actually works quite nicely with what DR can provide from a hardware enumeration perspective. DR can build drives for nodes in this environment and update the map as appropriate (even with disk-by-* ids). My helm foo is not good enough to understand or build this, yet. I was mostly building this to see what could be done.
So, what do we want to do with this one?
|
there hasn't been activity on this PR in quite some time. the issue will end up getting dropped once the project is imported in openstack proper (potentially next week). |
|
there doesn't seem to be any interest in contributing back to the requests that have been recommended. closing this PR. |
|
Let's make sure to note links to these somewhere in the roadmap document perhaps. It contains discussion points we want to preserve because even though the issue itself is closed, the work still needs to be done. The new request, blueprint, whatever form that takes within OpenStack should benefit from this conversation/direction. |
|
@alanmeadows yes, I'm actually in the process of including these items into the roadmap docs and slating the rework for a release (i'm looking at 0.3.0 or 0.4.0 as a target). the discussion points you made about drive zapping, and general drive layout definition in the chart values is a perfectly valid ask/approach for this rework effort. |
I have been trying to bring up Openstack on Kubernetes deployed by DigitalRebar with Kargo on KVM instances. :-)
On my KVM systems, I have "real" disks attached for ceph consumption. These changes all the current defaults to be maintained, but real disks can be injected into the configuration. It assumes system homogeneity but it is a start.
This change is