Quick use case : Remove a field from a batch of documents in elasticsearch index

I have an index that holds some inventory data from network devices, and tried to update these documents using logstash with other attributes from differents systems like billing.

Surprise! after 2 hours of batch running with logstash, i discovered that that new object (set of attributes from billing) added is not what i'm expecting and i started lokking for a quick way to delete this object from all documents.

The solution was to use update_by_query like this :

POST snmp-inventory-devices/_update_by_query?wait_for_completion=true&conflicts=proceed
{
  "script": "ctx._source.remove('billing')",
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "billing"
          }
        }
      ]
    }
  }
}

Internally elasticsearch does a scan/scroll to collect batches of documents and then update them like the bulk update interface. This is faster than doing it manually with my own scan/scroll interface due to not having the overhead of network and serialization. Each record must be loaded into ram, modified and then written.

Looking into setting conflict=proceed if the cluster has other update traffic, or the whole job will stop when it hits a ConflictError when one of the records is updated underneath one of the batches.

Similarly setting wait_for_completion=false will cause the update_by_query to run via the tasks interface. Otherwise the job will terminate if the connection is closed due to timeout.