Quick use case : Remove a field from a batch of documents in elasticsearch index
I have an index that holds some inventory data from network devices, and tried to update these documents using logstash with other attributes from differents systems like billing.
Surprise! after 2 hours of batch running with logstash, i discovered that that new object (set of attributes from billing) added is not what i'm expecting and i started lokking for a quick way to delete this object from all documents.
The solution was to use update_by_query like this :
POST snmp-inventory-devices/_update_by_query?wait_for_completion=true&conflicts=proceed
{
"script": "ctx._source.remove('billing')",
"query": {
"bool": {
"must": [
{
"exists": {
"field": "billing"
}
}
]
}
}
}
Internally elasticsearch does a scan/scroll to collect batches of documents and then update them like the bulk update interface. This is faster than doing it manually with my own scan/scroll interface due to not having the overhead of network and serialization. Each record must be loaded into ram, modified and then written.
Looking into setting conflict=proceed
if the cluster has other update traffic, or the whole job will stop when it hits a ConflictError when one of the records is updated underneath one of the batches.
Similarly setting wait_for_completion=false
will cause the update_by_query
to run via the tasks interface. Otherwise the job will terminate if the connection is closed due to timeout.