General

Why should I use Arkime?

If you want a standalone open-source full packet capture (FPC) system with metadata parsing and searching, then Arkime is the solution! Full packet capture systems allow network and security analysts to see exactly what happen from a network point of view. Since Arkime is open-source, you have complete control of the deployment and architecture. There are other FPC systems available.

How do you pronounce our name?

(/ɑːrkɪˈmi/)? Read more about why we changed our name here.

Upgrading Arkime

Upgrading Arkime requires you to install major versions in order, as described in the chart below. If your current version isn’t listed, please upgrade to the next-highest version in the chart. You can then install the major releases in order to catch up. New installs can start from the latest version. Unless otherwise stated, you should only need to db.pl upgrade between versions.

Name Version Min Version to
Upgrade From
OpenSearch
Versions
Elasticsearch
Versions
Special Instructions Notes
Arkime 4.0+ 3.3.0+ (3.4.0 recommended) 1.0.0+ (2.3 recommended) 7.10+ Arkime 4.x instructions
Arkime 3.0+ 2.4.0 1.0.0+ (2.3 recommended) 7.10+, not 8.x Arkime 3.x instructions
Arkime 2.7+ 2.0.0 N/A 7.4+ (7.9.0+ recommended, 7.7.0 broken) Elasticsearch 7 instructions
Moloch 2.2+ 1.7.0 (1.8.0 recommended) N/A 6.8.2+ (6.8.6+ recommended), 7.1+ (7.8.0+ recommended, 7.7.0 broken) Moloch 2.x instructions Must already be on 6.8.x or 7.1+ before upgrading to 2.2
Moloch 2.0, 2.1 1.7.0 (1.8.0 recommended) N/A 6.7, 6.8, 7.1+ Moloch 2.x instructions Must already be on Elasticsearch 6.7 or 6.8 (Elasticsearch 6.8.6 recommended) before upgrading to 2.0
Moloch 1.8 1.0.0 (1.1.x recommended) N/A 5.x or 6.x Elasticsearch 6 instructions Must have finished the 1.x reindexing; stop captures for best results
Moloch 1.1.1 0.20.2 (0.50.1 recommended) N/A 5.x or 6.x (new only) Instructions Must be on Elasticsearch 5 already
Moloch 0.20.2 0.18.1 (0.20.2 recommended) N/A 2.4, 5.x Elasticsearch 5 instructions

What operating systems are supported?

We have RPMs/DEBs/ZSTs available on the downloads page. Our deployment is on RHEL 7 and RHEL 8, using both the pcap and afpacket reader, depending on the deployment. We recommend using afpacket (tpacketv3) whenever possible. A large amount of development is done on macOS 12.5 using MacPorts or Homebrew; however, it has never been tested in a production setting. :) Arkime is no longer supported on 32-bit machines. Currently we do not support Ubuntu releases that aren't LTS and there may be library issues.

The following operating systems should work out of the box:

  • Arch
  • CentOS/RHEL 7, 8, 9
  • Amazon Linux 2
  • Ubuntu 18.04, 20.04, 22.04

Arkime is not working

Here is the common checklist to perform when diagnosing a problem with Arkime (replace /opt/arkime with /data/moloch for Moloch builds):

  1. Check that OpenSearch/Elasticsearch is running and GREEN by using the curl command curl http://localhost:9200/_cat/health on the machine running OpenSearch/Elasticsearch. An Unauthorized response probably means that you need user:pass in all OpenSearch/Elasticsearch URLs or that you are using the wrong URL.
  2. Check that the db has been initialized with the /opt/arkime/db/db.pl http://elasticsearch.hostname:9200 info command. You should see information about the database version and number of sessions.
  3. Check that viewer is reachable by visiting http://arkime-viewer.hostname:8005 from your browser.
    1. If it doesn’t render, looks strange, or warns of an old browser, use a newer supported browser.
    2. If the browser can't connect and you are sure viewer.js is running, verify there are no firewalls blocking access between your browser and the viewer host.
    3. Make sure viewHost=localhost is NOT set in the config.ini file. Test that curl http://IP:8005 works from the host viewer is running on.
  4. Check for errors in /opt/arkime/logs/viewer.log and that viewer is running with the pgrep -lf viewer command. If the UI looks strange or isn't working, viewer.log will usually have information about what is wrong.
  5. Check for errors in /opt/arkime/logs/capture.log and that capture is running with the pgrep -lf capture command. If packets aren't being processed or other metadata generation issues, capture.log will usually have information about what is wrong and links to the FAQ on how to fix.
  6. To check that the stats page shows the capture nodes you are expecting, visit http://arkime-viewer.hostname:8005/stats?statsTab=1 in your browser.
    1. If the packets being received for any node is low, that node is having issues, please check its capture.log
    2. If the timestamp for any node is over 5 seconds old, that node is having issues, please check its capture.log
    3. If the Disk Q or ES Q for any node is above 50, that node is having issues, please check its capture.log
  7. Disable any bpf= in /opt/arkime/etc/config.ini. If that fixes the issue, read the BPF FAQ answer.
  8. If the browser has "Oh no, Arkime is empty! There is no data to search." but the stats tab shows packets are being captured:
    1. Arkime in live capture mode only writes records when a session has ended. It may take several minutes for a session to show up after a fresh start. See /opt/arkime/etc/config.ini to shorten the timeouts.
    2. OpenSearch/Elasticsearch will only refresh the indices once a minute with the default Arkime configuration. Force a refresh with the curl http://elasticsearch.hostname:9200/_refresh command.
    3. Verify that your time frame for search covers the data (try switching to ALL).
    4. Check that you don’t have a view set.
    5. Check that your user doesn’t have a forced expression set. You might need to ask your Arkime admin.
  9. If you are having packet capture issues, restarting capture after adding a --debug option may print out useful information about what is wrong. You can add multiple --debug options to get even more information. Capture will print out the configuration settings it is using; verify that they are what you expect. Usually this setting is changed in /etc/systemd/system/molochcapture.service. Then run systemctl daemon-reload.
  10. If you are having issues viewing packets that were captured, restarting viewer after adding a --debug option may print out useful information about what is wrong. Usually this setting is changed in /etc/systemd/system/molochviewer.service. Then run systemctl daemon-reload.
    1. Make sure the plugins and parsers directories are correctly set in /opt/arkime/etc/config.ini and readable by the viewer process.
  11. Check the output of the following:
    grep moloch_packet_log /opt/arkime/logs/capture.log | tail
    Verify that the packets number is greater than 0. If not, then no packets were processed. Verify that the first pstats number is greater than 0. If not, Arkime didn't know how to decode any packets.

How do I reset Arkime?

  1. Leave OpenSearch/Elasticsearch running.
  2. Shut down all running viewer or capture processes so that no new data is recorded.
  3. To delete all the SPI data stored in OpenSearch/Elasticsearch, use the db.pl script with either the init or wipe commands. The only difference between the two commands is that wipe leaves the added users so that they don’t need to be re-added.
    /opt/arkime/db/db.pl http://ESHOST:9200 wipe
  4. Delete the PCAP files. The PCAP files are stored on the file system in raw format. You need to do this on all of the capture machines.
    /bin/rm -f /opt/arkime/raw/*

Self-Signed or Private CA TLS Certificates

The core Arkime team does not support or recommend self-signed certificates, although it is possible to make them work. We suggest using your cost savings from using a commercial full capture product to purchase certificates. Wildcard certificates are now inexpensive, and you can even choose free Lets Encrypt certificates. Members of the Arkime Slack workspace may be willing to help out, but the core developers may just link to this answer. Private CA certificates will have the same issues and solutions as self-signed certificates.

Potentially the easiest solution is to add the self-signed certificate to the operating system's list of valid certificates or chains. Googling is the best way to figure out how to do this—it is different for almost every OS release and version. You may need to add the certificate to several lists because node (viewer), curl (capture), and perl (db.pl) sometimes use different locations for their list of trusted certificates. Viewer supports a caTrustFile option that was contributed to the project. Since 4.2.0 all pieces of Arkime should support the caTrustFile setting.

Another option is to just turn off certificate checking. Capture, viewer, arkime_add_user.sh, and db.pl can run with --insecure to turn off certificate checking. You will need to add this option to the startup command for both capture and viewer. For example, in the /etc/systemd/system/arkimecapture.conf file, change the ExecStart line from ... capture -c ... to ... capture --insecure -c .... You would need to do the same thing for any viewer systemd files.

How do I upgrade to Moloch 1.x?

Moloch 1.x has some large changes and updates that will require all session data to be reindexed. The reindexing is done in the background AFTER upgrading, so there is little downtime. Large changes in 1.0 include the following:

  • All the field names have been renamed, and analyzed fields have been removed.
  • Country codes are being changed from three characters to two characters.
  • Tags will NOT be migrated if added before 0.14.1.
  • The data for http.hasheader and email.hasheader will NOT migrate.
  • IPv6 is fully supported and uses the OpenSearch/Elasticsearch ip type.

If you have any special parsers, taggers, plugins, or WISE sources, you may need to change configurations.

  • All db fields will need -term removed, or capture won’t start and will warn you.

To upgrade:

  • First make sure you are using Elasticsearch 5.5.x (5.6 recommended) and Moloch 0.20.2 or 0.50.x before continuing. Upgrade to those versions first!
  • Download 1.1.1 from the downloads page.
  • Shut down all capture, viewer, and WISE processes.
  • Install Moloch 1.1.1.
  • Run /data/moloch/bin/moloch_update_geo.sh on all capture nodes that will download the new mmdb style maxmind files.
  • Run db.pl http://ESHOST:9200 upgrade once.
  • Start WISE, then capture, then viewers. Especially watch the capture.log file for any warnings/errors.
  • Verify that NEW data is being collected and is showing up in viewer. All old data will NOT show up yet.

Once 1.1.1 is working, you need to reindex the old session data:

  • Disable any db.pl expire or optimize jobs or curator.
  • Start screen or tmux, because this will take several days.
  • In the /data/moloch/viewer directory, run /data/moloch/viewer/reindex2.js --slices X.
    • The number of slices should be between 2 and the number of shards each index has; the higher slices the faster the conversion, but the more OpenSearch/Elasticsearch CPU that will be used. We recommend half the number of shards.
    • You can optionally add an --index option if there are indices you need to reindex first. Otherwise, it will work from newest to oldest.
    • You can optionally add --deleteOnDone, which will delete indices as they are converted, but you may want to try a reindex on one index first to make sure it is working.
  • As reindex runs, old data will show up in viewer.
  • Delete ALL old indices with the following:
    curl -XDELETE 'http://localhost:9200/sessions-*'
  • Once the reindex finishes, run the db.pl expire/optimize or curator job manually. This will take a while.
  • Now you can reenable any db.pl expire or optimize jobs or curator. Do NOT reenable crons until you let them run and finish manually.

How do I upgrade to Moloch 2.x?

Upgrading to Moloch 2.x is a multistep process that requires an outage. An outage is required because all the captures must be stopped before upgrading the database so that there are no schema issues or corruption. Most of the administrative indices will have new version numbers after this upgrade so that Elasticsearch knows they were created with 6.7 or 6.8. This is very important when upgrading to Elasticsearch 7.x or later.

  • You must be using Moloch 1.7 or 1.8 (Moloch 1.8.0 recommended) BEFORE trying to upgrade to Moloch 2.x.
  • You must be using Elasticsearch 6.7 or 6.8 (Elasticsearch 6.8.6 or later is recommended) BEFORE trying to upgrade to Moloch 2.x.
  • Install Moloch >= 2.0 without restarting captures/viewers.
  • Optional: Run ./db.pl http://ESHOST:9200 backup pre20 to back up all administrative indices.
  • Shut down captures.
  • Run ./db.pl http://ESHOST:9200 upgrade.
  • Restart all capture, multies, and viewers (including both standalone viewers and those running with captures).
  • Verify that everything is working.

How do I upgrade to Arkime 3.x?

Upgrading to Arkime 3.x is a multistep process that requires an outage. An outage is required because all the captures MUST be stopped before upgrading the database so that there are no schema issues or corruption. Do not restart the capture processes until the db.pl upgrade has finished! All of the administrative indices will have new version numbers after this upgrade so that Elasticsearch knows they were created with version 7. This is very important when upgrading to Elasticsearch 8.x or later.

Breaking Changes

  • Elasticsearch before 7.10 is not supported.
  • All indices will now start with arkime_ after upgrading if a prefix was not previously used.
  • multies – The multiESNodes requires a name: attribute per entry. Versions 3.0.0–3.3.0 require a prefix: setting. Also, starting with 3.3.1, it defaults to prefix:arkime_.
  • wise – Custom sources will need to be modified to use the new JavaScript class design.
  • wise – Redis URLs have a new standard format.
  • wise – For JSON data, keyColumn has been renamed keyPath.
  • You may need to set the usersPrefix setting if your users index lives on an Arkime cluster that hasn't been upgraded to use arkime_ yet.
  • ilm – You will need to run the db.pl ilm command again after upgrading.

Instructions

  • You must be using Moloch 2.4+ (Moloch 2.7.1 is recommended) BEFORE trying to upgrade to Arkime 3.x.
  • You must be using Elasticsearch 7.10+ (Elasticsearch 7.10.2 or later is recommended) BEFORE trying to upgrade to Arkime 3.x.
  • Optional: Run ./db.pl http://ESHOST:9200 backup pre30 to back up all administrative indices.
  • Install Arkime or Arkime/Moloch Hybrid >= 3.0 without restarting captures/viewers (the hybrid distribution still uses /data/moloch and old binary names).
  • Shut down captures, multies, and viewers.
  • Run ./db.pl http://ESHOST:9200 upgrade [other options], and don't forget to include any other options you usually use, like --replicas or --ilm.
  • Verify that your config.ini and systemd files have the new /opt/arkime path instead of /data/moloch if moving from the Moloch/Hybrid to the Arkime distribution. If you continue to use the Hybrid distribution you do not need to change the paths.
  • If using ILM, run the db.pl ilm command again with all the same options that were used previously.
  • Restart all captures, multies, and viewers (including both standalone viewers and those running with captures).
  • Verify that everything is working.

How do I upgrade to Arkime 4.x?

Upgrading to Arkime 4.x requires that you are already using Arkime 3.3.0 or later. Arkime 4.x uses a new permissions model with roles.

Breaking Changes

  • systemd files are auto installed, but you still need to enable them manually.
  • Use roles for permission checking—the userAdmin role is required to edit users.
    addUser.js use either the new --roles option or the --admin sets the superAdmin role
  • In header auth mode, userAuthIps allows only localhost by default.
  • Encrypted PCAP files now use the .arkime extension.
  • The WISE multiES prefix now defaults to arkime_.
  • There are new defaults for the maxFileSizeG=12 and compressES=true settings.
  • PCAP compression is turned on by default with simpleCompression=gzip; set to "none" to disable or "zstd".
  • The right-click group name was changed to value-actions in the configuration file.
  • The userId search on the history page no longer adds the surrounding wildcards automatically. This search box is only available for admin users.

Instructions

  • Install new Arkime rpm/db.
  • Shut down ALL captures, multies, and viewers.
  • Run ./db.pl http://ESHOST:9200 upgrade [other options], and don't forget to include any other options you usually use, like --replicas or --ilm.
  • Verify that your config.ini and systemd files have the new /opt/arkime path instead of /data/moloch if moving from the Moloch/Hybrid to the Arkime distribution. If you continue to use the Hybrid distribution, you do not need to change the paths.
  • Restart all captures, multies, and viewers (including both standalone viewers and those running with captures).
  • Verify that everything is working.

OpenSearch/Elasticsearch

Arkime supports both OpenSearch and Elasticsearch, and our goal is to continue to support both. Some older documentation and settings may only refer to Elasticsearch, but OpenSearch should work for Arkime versions supporting Elasticsearch 7+. As OpenSearch and Elasticsearch diverge, we may add features that are only enabled based on which is being used. Arkime will never require any Elasticsearch pay features but may optionally support them.

How many OpenSearch/Elasticsearch nodes or machines do I need?

The answer, of course, is "it depends." Factors include:

  • How much memory each box has.
  • For how many days you want to store metadata (SPI data).
  • How fast the disks are.
  • What percentage of the traffic is HTTP/DNS; these session use more OpenSearch/Elasticsearch resources.
  • The average transfer rate of all the interfaces.
  • Whether the sessions are long lived or short lived.
  • How fast response times should be for operators.
  • How many operators are querying at the same time.

The following are some important things to remember when designing your cluster:

  • SPI data is usually kept longer then PCAP data. For example, you may store PCAP data for a week but SPI data for a month.
  • Have at least 1% of disk space used by OpenSearch/Elasticsearch available in OpenSearch/Elasticsearch heap memory. For example, if the cluster has 7 TB of data, then 7*0.01 or 70 GB of OpenSearch/Elasticsearch heap memory is the minimum recommended.
  • Assign half the memory to OpenSearch/Elasticsearch (but no more then 30 G per node; read https://www.elastic.co/blog/a-heap-of-trouble) and half the memory to disk cache.
  • Use at least version 7 of Elasticsearch or version 2.3 of OpenSearch.
  • A quick disk requirement estimate is 5% of the PCAP storage, if storing for the same amount of time.
  • If you have large machines, they you can run multiple nodes per MACHINE, although this complicates deployments.

We have some estimators that may help.

The good news is that it is easy to add new nodes in the future, so feel free to start with fewer nodes. As a temporary fix for capacity problems, you can reduce the number of days of metadata that are stored. You can use the Arkime ES Indices tab to delete the oldest sessions2 or sessions3 index.

Data never gets deleted

The SPI data in OpenSearch/Elasticsearch and the PCAP data are not deleted at the same time. The PCAP data is deleted as the disk fills up on the capture machines. See here for more information. PCAP deletion happens automatically, and nothing needs to be done. The SPI data is either deleted by using ILM or when the ./db.pl expire command is run, usually from cron during off peak. Unless you use ILM, the SPI data deletion does NOT happen automatically, and a cron job MUST be set up. A cron setup that only keeps 90 days of data and expires at midnight might look like this:

 0 0 * * * /opt/arkime/db/db.pl http://localhost:9200 expire daily 90

So deleting a PCAP file will NOT delete the SPI data, and deleting the SPI data will not delete the PCAP data from disk.

The UI does have commands to delete and scrub individual sessions, but the user must have the Remove Data ability on the users tab. This feature is used for things you don’t want operators to see, such as bad images, and not as a general solution for freeing disk space.

ERROR - Dropping request

This error means that your OpenSearch/Elasticsearch cluster can not keep up with the number of sessions that the capture nodes are trying to send it or there are too many messages being sent. You may only see the error message on your busiest capture nodes because capture tries to buffer the requests.

Check the following:

  • If OpenSearch/Elasticsearch is running on the same machine as capture, that is almost certainly the issue. While that is fine for a proof of concept, you will continue to run into problems.
  • The ES Nodes tab of the Stats section has the ability to turn on Write Task completed and rejected columns. Look for OpenSearch/Elasticsearch nodes having issues. Make sure those nodes don’t have disk issues.
  • Make sure each OpenSearch/Elasticsearch node has 30 G of memory and 30 G of disk cache (at least) available to it. So for example, if you are on a 64 G machine, only run 1 OpenSearch/Elasticsearch node on the machine.
  • Try increasing the dbBulkSize to a larger value. Start with 4000000 (4MB), but we don't recommend larger then 20MB.
  • Try decreasing the packetThreads to a smaller value. Many folks have set packetThreads too large, which causes extra messages to OpenSearch/Elasticsearch. We recommend starting with packetThreads = 2 x Gbps.
  • OpenSearch/Elasticsearch does NOT perform well if there is one node that is sick. Check all the node hardware, disks, RAID, etc. Make sure that on the ES Nodes tab, there isn't a sigle node with a high OS Load and low Write/s, which might indicate an issue.
  • Make sure swap is turned off or swappiness is 0 on OpenSearch/Elasticsearch machines.
  • If you are running multiple OpenSearch/Elasticsearch nodes, make sure the disks can support the IOPS load. It is usually best to have each OpenSearch/Elasticsearch node use its own disk.
  • Make sure you are running the latest OpenSearch/Elasticsearch version that the version of Arkime supports; for example, 7.10.2+ if using Elasticsearch 7 or 2.3+ if using OpenSearch.
  • If using replication on the sessions index, turn off replication of the current day and only replicate previous days. This can be done by using --replicas 1 with your daily ./db.pl expire run after turning off replication in the sessions template using ./db.pl upgrade without the --replicas option.
  • Make sure there is at most one shard of each session per node. If there are more, run ./db.pl upgrade again.

If these don’t help, you need to add more nodes or reduce the number of sessions being monitored. You can reduce the number of sessions with packet-drop-ips, bpf filters, or rules files, for example.

When do I add additional nodes? Why are queries slow?

If queries are too slow, the easiest fix is to add additional OpenSearch/Elasticsearch nodes. OpenSearch/Elasticsearch doesn’t perform well if Java hits an OutOfMemory condition. If you ever have one, you should immediately delete the oldest *sessions* index, update the daily expire cron to delete more often, and restart the OpenSearch/Elasticsearch cluster. Then you should order more machines. :)

Removing Nodes

  1. Go into the Arkime stats page and the ES Shards subtab.
  2. Click on the nodes you want to remove and exclude them.
  3. Wait for the shards to be moved.
  4. If no shards move, you may need to configure OpenSearch/Elasticsearch to allow two shards per node, although a large number may be required if you are removing many nodes.
    curl -XPUT 'localhost:9200/sessions*/_settings' -d '{
      "index.routing.allocation.total_shards_per_node": 2
    }'
  5. If there are many shards that need to be redistributed, the defaults might take days, which is good for the cluster. Increase the speed from the default 3 streams at 20 mb (60 mb/sec) to something higher like 6 streams at 50 mb (300 mb/sec). Adjust for the speed of the new nodes' disks and network.
    curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{
      "indices.recovery.concurrent_streams":6,
      "indices.recovery.max_bytes_per_sec":"50mb"}
    }'

How to enable OpenSearch/Elasticsearch replication

Turning on replication will consume twice the disk space on the nodes and increase the network bandwidth between nodes, so make sure you actually need replication.

To change future days, run the following command:

db/db.pl <http://ESHOST:9200> upgrade --replicas 1

To change past days but not the current day, run the following command:

db/db.pl <http://ESHOST:9200> expire <type> <num> --replicas 1

We recommend the second solution because it allows current traffic to be written to OpenSearch/Elasticsearch once, and during off peak the previous day's traffic will be replicated.

How do I upgrade OpenSearch/Elasticsearch?

In general, if upgrading between minor or build versions of Elasticsearch, you can perform a rolling upgrade with no issues. Follow Elastic's instructions for the best results. Make sure you select the matching version of that document for your version of Elasticsearch from the dropdown menu on the right side of the screen.

Upgrading between major versions of Elasticsearch usually requires an upgrade of Arkime. See the following instructions:

How do I upgrade to Elasticsearch 8.x?

Elasticsearch 8.x is NOT supported before 3.4.1, and we recommend that you use Arkime 4.x.

  1. You must first upgrade to Arkime 3.4.1 or higher and run db.pl http://ESHOST:9200 upgrade while still using Elasticsearch 7.
  2. Elasticsearch 8 can only perform upgrades from Elasticsearch 7.17 or later, so you will need to upgrade Elasticsearch to 7.17 or later.
  3. Make sure your Elasticsearch configuration files are ready for Elasticsearch 8. We do not provide sample Elasticsearch configurations, but here are some things to look out for:
    • By default, Elasticsearch 8 enables HTTPS and passwords, make sure you update your Arkime configuration file to use them.
    • There are several configuration variable changes. We suggest trying your elasticsearch.yml configuration file with a test cluster.
    • You may want to read the Elasticsearch 8.0 breaking changes.
  4. Follow Elastic's rolling upgrade instructions.

How do I upgrade to Elasticsearch 7.x?

  • Elasticsearch 7.x is supported by Moloch 2.x only if there are no Elasticsearch 5.x–created indices remaining. We recommend you upgrade to Elasticsearch 7.8.x or later.
  • Upgrading to Elasticsearch 7 MAY REQUIRE downtime.

  1. If you are NOT using Arkime DB version 63 (or later), you must follow these instructions while still using Elasticsearch 6.x and upgrade to Moloch 2.x. To find what DB version you are using, either run db.pl http://ESHOST:9200 info or mouse over the in Arkime.
  2. Make sure your Elasticsearch configuration files are ready for Elasticsearch 7. We do not provide sample Elasticsearch configurations, but here are some things to look out for:
  3. Now you need to upgrade from Elasticsearch 6 to Elasticsearch 7. There are two options:
    1. Upgrading to Elasticsearch 7 if using Elasticsearch 6.8.6 (or later) can be done with a rolling upgrade. Follow Elastic's instructions for the best results. You do NOT need to stop capture/viewer, but after the rolling upgrade is finished, you may want to restart capture everywhere.
    2. If you are not using Elasticsearch 6.8.6, or if you would prefer to perform a full restart, follow the instructions below:
      1. Make sure you delete any old indices that db.pl notified you about when you installed Moloch 2.x.
      2. Shut down everything: Elasticsearch, capture, and viewer.
      3. Upgrade Elasticsearch to 7.x (7.8.0 or later is recommended).
      4. Start the Elasticsearch cluster.
      5. Wait for the cluster to go GREEN. This will take LONGER than usual as Elasticsearch upgrades indices from the 6.x to the 7.x format.
        curl http://localhost:9200/_cat/health
      6. Start viewers and captures.

How do I upgrade to Elasticsearch 6.x?

Elasticsearch 6.x is supported by Moloch 1.x for NEW clusters and >= 1.5 for UPGRADING clusters.

NOTE – If upgrading, you must FIRST upgrade to Moloch 1.0 or 1.1 (1.1.1 is recommended) before upgrading to > 1.5. Also, all reindex operations need to be finished.

We do NOT provide Elasticsearch 6 startup scripts or configuration, so if upgrading, make sure you get startup scripts working on test machines before shutting down your current cluster.

Upgrading to Elasticsearch 6 will REQUIRE two downtimes.

First outage: If you are NOT using Moloch DB version 51 (or later), you must follow these steps while still using Elasticsearch 5.x. To find what DB version you are using, either run db.pl http://ESHOST:9200 info or mouse over the in Moloch.

  • Install Moloch >= 1.5.
  • Shut down capture.
  • Run ./db.pl http://ESHOST:9200 upgrade .
  • Restart capture.
  • Verify that everything is working.
  • Make sure you delete the old indices that db.pl notified you about.

Second outage: Upgrade to Elasticsearch 6.

  • Make sure you delete the old indices that db.pl notified you about.
  • Shut down everything.
  • Upgrade Elasticsearch to 6.x.
  • WARNING – path.data will have to be updated to access your old data. If you had path.data: /data/foo, you will probably need to change to /data/foo/<clustername>.
  • Start the Elasticsearch cluster.
  • Wait for the cluster to go GREEN. This will take LONGER than usual as Elasticsearch upgrades indices from the 5.x to the 6.x format.
    curl http://localhost:9200/_cat/health
  • Start viewers and captures.

How do I upgrade to Elasticsearch 5.x?

Elasticsearch 5.x is supported by Moloch 0.17.1 for NEW clusters and 0.18.1 for UPGRADING clusters.

Elasticsearch 5.0.x, 5.1.x, and 5.3.0 are NOT supported because of Elasticsearch bugs/issues. We currently use 5.6.7.

WARNING – If you have sessions-* indices created with Elasticsearch 1.x, you can NOT upgrade. Those indices will need to be deleted.

We do NOT provide Elasticsearch 5 startup scripts, so if upgrading, make sure you get startup scripts working on test machines before shutting down your current cluster.

Upgrading to Elasticsearch 5 may REQUIRE 2 downtime periods of about 5–15 minutes each.

First outage: If you are NOT using Moloch DB version 34 (or later), you must follow these steps while still using Elasticsearch 2.4. To find what DB version you are using, either run db.pl http://ESHOST:9200 info or mouse over the in Moloch.

  • Upgrade to Elasticsearch 2.4.x.
  • Check for a GREEN Elasticsearch cluster:
    curl http://localhost:9200/_cat/health
  • Install Moloch 0.18.1 to 0.20.2.
  • Shut down all capture nodes.
  • Run ./db.pl http://ESHOST:9200 upgrade .
  • Start up captures and make sure everything works.
  • You can remain on Elasticsearch 2.4.x until you want to try Elasticsearch 5.

Second outage: Upgrade to Elasticsearch 5.

  • You MUST be using Elasticsearch 2.4.x and Moloch DB version 34 (or later) before using Elasticsearch 5 (see above).
  • Shut down EVERYTHING (Elasticsearch, viewer, capture).
  • Upgrade Elasticsearch to 5.6.x.
  • Start the Elasticsearch cluster.
  • Wait for the cluster to go GREEN. This will take LONGER than usual as Elasticsearch upgrades indices from the 2.x to the 5.x format.
    curl http://localhost:9200/_cat/health
  • Start viewers and captures.

version conflict, current version [N] is higher or equal to the one provided [M]

This error usually happens when the capture process is trying to update the stats data and falls behind. Arkime will continue to function while this error occurs with the stats or dstats index; however, it does usually mean that your Elasticsearch cluster is overloaded. You should consider increasing your Elasticsearch capacity by adding more nodes, CPU, and/or more memory. If increasing Elasticsearch capacity isn't an option, then reduce the amount of traffic that Arkime processes.
If the N vs. M version numbers are very different from each other, it usually means that you are running two nodes with the same node name at the same time, which is not supported.

Here are some of our recommended OpenSearch/Elasticsearch settings. Many of these can be updated on the fly, but it is still best to put them in your elasticsearch.yml file. We strongly recommend using the same elasticsearch.yml file on all hosts. Things that need to be different per host can be set with variables.

Disk Watermark

You will probably want to change the watermark settings so you can use more of your disk space. You have the option to use ALL percentages or ALL values, but you can’t mix them. The most common sign of a problem with these settings is an error that has FORBIDDEN/12/index read-only / allow delete in it. You can use ./db.pl http://ESHOST:9200 unflood-stage _all to clear the error once you adjust the settings and/or delete some data. Elasticsearch Docs

cluster.routing.allocation.disk.watermark.low: 97%
cluster.routing.allocation.disk.watermark.high: 98%
cluster.routing.allocation.disk.watermark.flood_stage: 99%

Or, if you want more control, use values instead of percentages:

cluster.routing.allocation.disk.watermark.low: 300gb
cluster.routing.allocation.disk.watermark.high: 200gb
cluster.routing.allocation.disk.watermark.flood_stage: 100gb

Shard Limit

If you have a lot of shards that you want to be able to search against at once Elasticsearch Docs

action.search.shard_count.limit: 100000

Write Queue Limit

No longer need to set since Elasticsearch 7.9. If you hit a lot of bulk failures, this can help, but Elastic doesn’t recommend raising too much. In older versions of Elasticsearch, it is named thread_pool.bulk.queue_size, so check the docs for your version. Elasticsearch Docs

thread_pool.write.queue_size: 10000

HTTP Compression

On by default in most versions, allows for HTTP compression. Elasticsearch Docs

http.compression: true

Recovery Time

To speed up recovery times and startup times, there are a few controls to experiment with. Make sure you test them in your environment and slowly increase them because they can break things badly. Elasticsearch Allocation Docs and Elasticsearch Recovery Docs

cluster.routing.allocation.cluster_concurrent_rebalance: 10
cluster.routing.allocation.node_concurrent_recoveries: 5
cluster.routing.allocation.node_initial_primaries_recoveries: 5
indices.recovery.max_bytes_per_sec: "400mb"

Logging

By default, Elasticsearch has logging set to debug level in log4j2.properties. For busy clusters, change this to info level to lower CPU and disk usage.


logger.action.level = info

Using ILM with Arkime

Since Moloch 2.2, you can easily use ILM to move indices from hot to warm, force merge, and delete. We recommend only using ILM with newer versions (7.2+) of Elasticsearch because older versions had some issues. Once ILM is enabled, you no longer have to use the db.pl expire cron job but should occasionally run db.pl optimize-admin.

ILM is only included in the free "basic" Elasticsearch license, so it is not part of the Elasticsearch OSS distribution, and you may need to upgrade. Arkime does NOT currently support the ILM auto rollover feature, for performance reasons, when searching.

These instructions assume you are using db.pl or Arkime UI to set up ILM and will use a special molochtype attribute name. You can also do this with Kibana to create the ILM config and not use the molochtype attribute name, but you will then need to do everything on your own. In order for ILM work correctly with Arkime, follow these five important steps:

  1. If you are using a hot/warm design or might in the future, for each Elasticsearch node, add a line to your elasticsearch.yml file with node.attr.molochtype: warm or node.attr.molochtype: hot
  2. Create the molochsessions and molochhistory ILM polices. This can be done easily with Kibana, or we recommend the db.pl ilm command.
  3. Assign the molochsessions and molochhistory ILM polices to all the existing indices. Kibana or db.pl ilm can perform this action.
  4. Change the moloch templates to use the ILM polices for NEW indices. You'll need to rerun db.pl upgrade ... --ilm and add --ilm to the command. Also add --hotwarm if using a hot/warm design.
  5. Replace the previous db.pl expire cron job with db.pl optimize-admin

So for example, to create a new policy that keeps 30 weeks of history, 90 days of SPI data, 1 replica, and optimizes all indices older than 25 hours, you would run: ./db.pl http://localhost:9200 ilm 25h 90d --history 30 --replicas 1 You would then need to run upgrade with all the arguments you usually use, plus --ilm: ./db.pl http://localhost:9200 upgrade --replicas 1 --shards 5 --ilm


Capture

What kind of capture machines should I buy?

The goal of Arkime is to use commodity hardware. If you start thinking about using SSDs or expensive NICs, research whether it would just be cheaper to buy one more box. This gains more retention and can bring down the cost of each machine.

Some things to remember when selecting a machine:

  • An average of 1 Gbps of network traffic requires 11 TB of disk a day. For example, to store 7 days of 2.5 Gbps–average traffic, you need 7*2.5*11, or 192.5 TB of disk space.
  • The total bandwidth number must include both RX and TX bandwidth. For example, a 10 G link can really produce up to 20 G of traffic to capture—10 G in each direction. Include both directions in your calculations.
  • Don’t overload network links. Monitoring a 10 G link with an average of 4 Gbps RX AND 4 Gbps TX should use 2 10 G capture links because 8 Gbps is close to the max.
  • Arkime requires all packets from the same 5-tuple to be processed by the same capture process.

When selecting Arkime capture boxes, standard "Big Data" boxes might be the best bet ($10k–$25k each). Look for:

  • CASE: There are many 4RU boxes out there. If space is an issue, there are more expensive 2RU boxes that hold over 20 drives (examples: HPE Apollo 4200 , Supermicro 6028R-E1CR24L, or Dell R740XD2)
  • MEMORY: 64 GB to 96 GB (or more if running other tools)
  • OS DISKS: We like RAID 1 small drives. SSDs are nice but not required.
  • CAPTURE DISKS: 20+ x 4 TB or larger SATA drives. Don’t waste money on enterprise/SAS/15k drives.
  • RAID: A hardware RAID card with at least 1 G cache (2 G is better). We like RAID 5 with 1 hot spare or RAID 6 (with better cards).
  • NIC: We like newer Intel base NICs, but most should work fine (you might want to get one compatible with PFRING).
  • CPU: At least 2 x 6 cores. The higher the average Gbps, the more speed/cores required.

We are big fans of using network packet brokers (NPBs) ($6k+). They allow multiple taps/mirrors to be aggregated and load balanced across multiple capture machines. Read more in the following sections.

What kind of NPB should I buy?

We are big fans of using NPBs, and we recommend that medium or large Arkime deployments use an NPB. See MolochON 2017 NPB Preso .

Main advantages:

  • Easy horizontal scaling of Arkime
  • Load balancing of traffic
  • Filtering of traffic before it hits the Arkime boxes
  • Easier to add more Arkime capacity or other security tools
  • Don’t have to worry as much about new links being added by the network team

Features to look for:

  • Load balancing
  • Consistent symmetric hashing (this means each direction of the flow goes out the same tool port)
  • MPLS/VLAN/VPN stripped (optional—some tools don’t like all the headers)
  • Tool link detection and failover
  • Automation capability (can you use Ansible/APIs, or are you stuck using a web UI?)
  • Enough ports to support future tap and tool growth
  • Whether the features desired require an extra (expensive?) component and/or license

Just like with Arkime with commodity hardware, you don’t necessarily have to pay a lot of money for a good NPB. Some switch vendors offer switches that can operate in switch mode or NPB mode, so you might already have gear you can use.

Sample vendors

What kind of packet capture speeds can arkime-capture handle?

On basic commodity hardware, it is easy to get 3 Gbps or more, depending on the number of CPUs available to Arkime and what else the machine is doing. Many times, the limiting factor is the speed of the disks and RAID system. See Architecture and Multiple Host for more information. Arkime allows multiple threads to be used to process the packets.

To test the local RAID device, use:

dd bs=256k count=50000 if=/dev/zero of=/THE_ARKIME_PCAP_DIR/test oflag=direct
To test a NAS, leave off the oflag=direct and make sure you test with at least 3x the amount of memory so that cache isn't a factor:
dd bs=256k count=150000 if=/dev/zero of=/THE_ARKIME_PCAP_DIR/test

This is the MAX disk performance. Run several times if desired and take the average. If you don’t want to drop any packets, you shouldn’t average more then ~80% of the MAX disk performance. If you are using RAID and don’t want to drop packets during a future rebuild, ~60% is a better value. Remember that most network numbers will be in bits, while the disk performance will be in bytes, so you’ll need to adjust the values before comparing.

Arkime requires full packet captures error

When you get an error about the capture length not matching the packet length, it is NOT an issue with Arkime. The issue is with the network card settings.

By default modern network cards offload work that the CPUs would need to do. They will defragment packets or reassemble tcp sessions and pass the results to the host. However this is NOT what we want for packet captures, we want what is actually on the network. So you will need to configure the network card to turn off all the features that hide the real packets from Arkime.

The sample config files (/opt/arkime/bin/arkime_config_interfaces.sh) turn off many common features but there are still some possible problems:

  1. If using containers or VMs for Arkime, you may need to turn off the features on the physical interface the VM interface is mapped to from the host OS, instead of inside the container/VM.
  2. If using a fancy card there may be other features that need to be turned off.
    1. You can find them usually with ethtool -k INTERFACE | grep on — Anything that is still on, turn off and see if that fixes the problem. Items that says [fixed] can NOT be disabled with ethtool.
    2. For example ethtool -K INTERFACE tx off sg off gro off gso off lro off tso off

There are two work arounds:

  1. If you are reading from a file you can set readTruncatedPackets=true in the config file, this is the only solution for saved .pcap files
  2. You can increase the max packet length with snapLen=65536 in the config file, this is not recommended

Why am I dropping packets? (and Disk Q issues)

There are several different types of packet drops and reasons for packet drops:

Arkime Version

Please make sure you are using a recent version of Arkime. Constant improvements are made and it is hard for us to support older versions.

Kernel and TPACKET_V3 support

The most common cause of packet drops with Arkime is leaving the reader default of libpcap instead of switching to tpacketv3, pfring or one of the other high performance packet readers. We strongly recommend tpacketv3. See plugin settings for more information.

Network Card Config

Make sure the network card is configured correctly by increasing the ring buf to max size and turning off most of the card’s features. The features are not useful anyway, since we want to capture what is on the network instead of what the local OS sees. Example configuration:

# Set ring buf size, see max with ethool -g eth0
ethtool -G eth0 rx 4096 tx 4096
# Turn off feature, see available features with ethtool -k eth0
ethtool -K eth0 rx off tx off gs off tso off gso off

If Arkime was installed from the deb/rpm and the Configure script was used, this should already be done in /data/moloch/bin/moloch_config_interfaces.sh

packetThreads and the PacketQ is overflowing error

The packetThreads config option controls the number of threads processing the packets, not the number of threads reading the packets off the network card. You only need to change the value if you are getting the Packet Q is overflowing error. The packetThreads option is limited to 24 threads, but usually you only need a few. Configuring too many packetThreads is actually worse for performance, please start with a lower number and slowly increase. You can also change the size of the packet queue by increasing the maxPacketsInQueue setting.

To increase the number of threads the reader uses please see the documentation for the reader you are using on the settings page.

Disk and Disk Q issues

In general errors about the Disk Q being exceeded are NOT a problem with Arkime, but usually an issue with either the hardware or the packet rate exceeding what the hardware can save to disk. You will usually need to either fix/upgrade the hardware or reduce the amount of traffic being saved to disk.

  • Make sure swap has been disabled, swappiness is 0, or at the very least, isn’t writing to the disk being used for PCAP.
  • Make sure the RAID isn’t in the middle of a rebuild or something worse. Most RAID cards will have a status of OPTIMAL when things are all good and DEGRADED or SUBOPTIMAL when things are bad.
  • To test the RAID device use:
    dd bs=256k count=50000 if=/dev/zero of=/THE_ARKIME_PCAP_DIR/test oflag=direct
    This is the MAX disk performance. Run several times if desired and take the average. If you don’t want to drop any packets, you shouldn’t average more then ~80% of the MAX disk performance. If using RAID and don’t want drop packets during a future rebuild, ~60% is a better value. Remember that most network numbers will be in bits while the disk performance will be in bytes, so you’ll need to adjust the values before comparing.
  • Make sure you actually have enough disk write thru capacity and disks. For example, for a 1G link with RAID 5 you may need:
    • At least 4 spindles if using a RAID 5 card with write cache enabled.
    • At least 8 spindles (or more) if using a RAID 5 card with write cache disabled.
  • Make sure your RAID card can actually handle the write rate. Many onboard RAID 5 controllers can not handle sustained 1G write rates.
  • Switch to RAID 0 from RAID 5 if you can live with the TOTAL data loss on a single disk failure.
  • If you are using xfs make sure you use mount options defaults,inode64,noatime
  • Don’t run capture and OpenSearch/Elasticsearch on the same machine.

If using EMC for disks:

  • Make sure write cache is enabled for the LUNs.
  • If it is a CX with SATA drives, RAID-3 is optimized for large sequential I/O.
  • Monitor EMC lun queue depth, may be too many hosts sharing it.

To check your disk IO run iostat -xm 5 and look at the following:

  • wMB/s will give you the current write rate, does it match up with what you expect?
  • avgqu-sz should be near or less then 1, otherwise linux is queueing instead of doing
  • await should be near or less then 10, otherwise the IO system is slow, which will slow capture down.

Other things to do/check:

  • If using RAID 5 make sure you have write cache enabled on the RAID card. Sometimes this is called WriteBack. Make sure the BBU is still good or you have write cache enabled even when the BBU isn't working or missing.
    • Adaptec Example: arcconf SETCACHE 1 LOGICALDRIVE 1 WBB
    • HP Example: hpssacli ctrl slot=0 modify dwc=enable
    • MegaCLI: MegaCli64 -LDSetProp -ForcedWB -Immediate -Lall -aAll ; MegaCli64 -LDSetProp Cached -L0 -a0 -NoLog

Other

  • There are conflicting reports that disabling irqbalancer may help.
  • Check that the CPU you are giving capture isn’t handling lots of interrupts (cat /proc/interrupts).
  • Make sure other processes aren’t using the same CPU as capture.

WISE

  • Cyclical packet drops may be caused by bad connectivity to the wise server. Verify that the WISE responds quickly
    curl http://arkime-wise.hostname:8081/views
    on the capture host that is dropping packets.

High Performance Settings

See settings

How do I import existing PCAPs?

Think of the capture binary much like you would tcpdump. The capture binary can listen to live network interface(s), or read from historic packet capture files. Currently Arkime works best with PCAP files, not PCAPng.

${install_dir}/bin/capture -c [config_file] -r [PCAP file]

For an entire directory, use -R [PCAP directory]

See ${install_dir}/bin/capture --help for more info. The --monitor to monitor non NFS directories, --skip to skip already loaded PCAP files, and -R to process directories are common options. Multiple -r and -R options can be used.

If Arkime is failing to load a PCAP file check the following things:

  • Use PCAP formatted files and not PCAPng
  • Make sure the PCAP files contain IP traffic, Arkime currently ignores ARP and other traffic.
  • Try running capture with --debug which might warn of not understanding the link type or GRE tunnel type. (Please open issues for unknown link or GRE types)

Enable Arkime UI to upload

It is also possible to enable UI in Arkime to upload PCAP. This is less efficient then just using capture directly, since it uploads the file and then rules capture for you. Just uncomment the uploadCommand in the config.ini file.

How do I monitor multiple interfaces?

The easy way is using the interface setting in your config.ini. It supports a semicolon ';' separated list of interfaces to listen on for live traffic. If you want to set a tag or another field per interface, use the interfaceOps setting.

The hard way, you can also have multiple capture processes,.

  • Arkime by default uses the unqualified hostname as the name of the Arkime node, so you’ll need to come up with a naming scheme. Appending a, b, c, …​ or the interface number to the hostname are possible methods.
  • Edit /opt/arkime/etc/config.ini, and create a section for each of the Arkime nodes. Assuming the defaults are correct in the [default] section, the only thing that MUST be set is the interface item. It is also common to have each Arkime node talk to a different OpenSearch/Elasticsearch node if running a cluster of OpenSearch/Elasticsearch nodes. The arkime-m01 is an EXAMPLE node name.
    [arkime-m01a]
    interface=eth2
    [arkime-m01b]
    interface=eth5
  • If hostname + domainname on the machine doesn’t return a FQDN, you’ll also need to set a viewUrl, or easier use the --host option.
  • You'll need two systemd scripts and modify them to use the two different node names. Something like.
    mv /etc/systemd/system/arkimecapture.service arkimecapture1.service
    cp /etc/systemd/system/arkimecapture1.service /etc/systemd/system/arkimecapture2.service
  • Now you'll need to edit those two files and add the -n options to the ExecStart lines after the capture. So something like ExecStart=/bin/sh -c '/opt/arkime/bin/capture -n arkime-m01a -c /opt/arkime/etc/config.ini and ExecStart=/bin/sh -c '/opt/arkime/bin/capture -n arkime-m01b -c /opt/arkime/etc/config.ini
  • Now you can use systemd to start them. systemctl daemon-reload; systemctl start arkimecapture1; systemctl start arkimecapture2

You only need to run one viewer on the machine. Unless it is started with the -n option, it will still use the hostname as the node name, so any special settings need to be set there (although default is usually good enough).

Arkime capture crashes

Please file an issue on github with the stack trace.

  • You’ll need to allow suid or user changing programs to save core dumps. Use sysctl to change until the next reboot. Setting it to 0 will change it back to the default.
    sysctl -w fs.suid_dumpable=2
  • The user that Arkime switches to must be able to write to the directory that capture is running in.
  • Run capture and get it to crash.
  • Look for the most recent core file.
  • Run gdb (you may need to install the gdb package first)
    gdb /opt/arkime/bin/capture corefilename
  • Get the back trace using the bt command

If it is easy to reproduce, sometimes it’s easier to just run gdb as root:

  • Run gdb capture as root.
  • Start Arkime in gdb with run ALL_THE_ARGS_USED_FOR_ARKIME-CAPTURE_GO_HERE.
  • Wait for crash.
  • Get the backtrace using bt command.
  • Sometimes, you need to put a break point in g_log b g_log

ERROR - pcap open failed

Usually capture is started as root so that it can open the interfaces and then it immediately drops privileges to dropUser and dropGroup, which are by default nobody:daemon. This means that all parent directories need to be either owned or at least executable by nobody:daemon and that the pcapDir itself must be writeable.

How to reduce amount of traffic/pcap?

Listed in order from highest to lowest benefit to Arkime

  1. Setting the bpf= filter will stop Arkime from seeing the traffic.
  2. Adding CIDRs to the packet-drop-ips section will stop Arkime from adding packets to the PacketQ
  3. Using Rules it is possible to control if the packets are written to disk or the SPI data is sent to OpenSearch/Elasticsearch

Life of a packet

Arkime capture supports many options for controlling which packets are captured, processed, and saved to disk.

  • The first gatekeeper and most important is the bpf filter, bpf= in the config file. This filter can be implemented in the kernel, the network card, libpcap or network drivers. It is a single filter and it controls what Arkime capture "sees" or doesn’t "see". Any packet that is dropped because of the bpf filter is usually not counted in ANY Arkime stats, but some implementation do expose stats.
  • Arkime does a high level decode of the ethernet, IP, IP protocol information and sees if it understands it. If it doesn’t supports it, Arkime will discard the packet.
  • Arkime checks the packet-drop-ips config section to see if the IPs involved are marked to be discarded. If there are only a few IPs to drop then bpf= should be used, otherwise this is much more efficient then a huge bpf.
  • For TCP packets, Arkime checks against previous matched against rules that set a _dropByDst or _dropBySrc timeout, if it matches they will be discarded.
  • Arkime picks a packet queue to send the packet to, if the packet queue is too busy it will drop the packet. Potentially increase packetThreads or maxPacketsInQueue if too many packets are being dropped here.
  • A packet queue will start processing a packet and update all the stats and basic information for the session the packet is associated with.
  • The sessionSetup rules for first packets in a session are executed, which might set operations that control packet saving.
  • If this is the first packet of the session the packet queue will then check all the dontSaveBPFs, and if one matches it will save off the max number of packets to save for the session. This will override the maxPackets config setting.
  • If this is the first packet of the session AND no dontSaveBPFs matched, the packet queue will then check all the minPacketsSaveBPFs and save off a min number of packets that must be received.
  • Finally Arkime goes to save the packet, if it has already saved the max number of packets for the session (set by rules or dontSaveBPFs) OR if there was another method (plugin) that said stop saving packets for the session the packet won’t be saved.
  • If the number of packets for the session is greater then maxPackets the session will be saved, a new linked session will be started for future packets. The beforeMiddleSave and beforeBothSave rules will be executed before saving.
  • The packet queue sends the packet off to the various classifiers and parsers to gather more meta data. The afterClassify rules will be executed, and if any fields are set during this processing the fieldSet rules will be executed. Rules may change if future packets are saved.
  • At some point in the future the session will hit one of the timeouts and the session will be saved if there have been enough packets saved to meet the min number of packets received setting per session. (Defaults to 0) The beforeFinalSave and beforeBothSave rules will be executed.

PCAP Deletion

PCAP deletion is actually handled by the viewer process, so make sure the viewer process is running on all capture boxes. The viewer process checks on startup and then every minute to see how much space is available, and if it is below freeSpaceG, then it will start deleting the oldest file. The viewer process will log every time a file is deleted, so you can figure out when a file is deleted if you need to. If the viewer complains about not finding the PCAP data, make sure you check the viewer.log.

Note: freeSpaceG can also be a percentage, with freeSpaceG=5% for the default. The viewer process will always leave at least 10 PCAP files on the disk, so make sure there is room for at least maxFileSizeG * 10 capture files on disk, or by default 120G.

If still having PCAP delete issues:

  1. Make sure freeSpaceG is set correctly for the environment and verify the setting by running viewer with --debug.
  2. Make sure there is free space where viewer is writing its logs.
  3. Make sure viewer can reach OpenSearch/Elasticsearch
  4. Make sure that dropUser or dropGroup can actually delete files in the PCAP directory and has read/execute permissions in all parent directories. So for example you need to check the /opt and the /opt/arkime and the /opt/arkime/raw directory permissions. The PCAP files will have read/write permissions which is normal.
  5. Make sure the PCAP directory is on a filesystem with at least maxFileSizeG * 10 space available.
  6. If there is a mismatch between the files in the directory and the files on the Files tab run the db.pl http://localhost:9200 sync-files command
  7. Make sure the files in the file tab don’t have locked set, viewer won’t deleted locked files
  8. Try restarting viewer
  9. The viewer.log should be printing out that it is trying to delete files, you can look for the string "Deleting" in the viewer.log
  10. Restart viewer by turning on debugging, either add --debug to the start line or add debug=1 in the [default] section of your config.ini file.
  11. If using seilnux (sestatus) temporarily disable it (setenforce 0) and see if that fixes the problem.

dontSaveBPFs doesn’t work

There are several common reasons dontSaveBPFs might not work for you.

  1. Look at the saved PCAP, not the packet count in the UI, Arkime will still count the number of packets, it just won't save them
  2. Make sure you've spelled it dontSaveBPFs, case matters
  3. Make sure you've placed dontSaveBPFs in the correct section, you can verify by adding a --debug to capture when starting and looking at the output
  4. Turns out BPF filters are tricky. :) When the network is using vlans, then at compile time, BPFs need to know that fact. So instead of a nice simple dontSaveBPFs=tcp port 443:10 use something like dontSaveBPFs=tcp port 443 or (vlan and tcp port 443):10. Basically FILTER or (vlan and FILTER). Information from here.
  5. Try testing your filter manually with tcpdump, you should only see the traffic you want to drop. So something like tcpdump -i INTERFACE tcp port 443 for example.

If still having issues, you might just try out a Arkime Rules file. Arkime converts dontSaveBPFs into a rule for you behind the scenes, so Arkime Rules are actually more powerful.

Zero or missing bytes PCAP files

Arkime buffers writes to disk, which is great for high bandwidth networks, but bad for low bandwidth networks. How much data is buffered is controlled with pcapWriteSize, which defaults to 262144 bytes. An important thing to remember is the buffer is per thread, so set packetThreads to 1 on low bandwidth networks. A portion of the PCAP that is buffered will be written after 10 seconds of no writes. However it will still buffer the last pagesize bytes, usually 4096 bytes.

An error that looks like ERROR - processSessionIdDisk - SESSIONID in file FILENAME couldn't read packet at FILEPOS packet # 0 of 2 usually means that either the PCAP is still being buffered and you need to wait for it to be written to disk or that previously capture or the host crashed/restarted before the PCAP could be written to disk.

You can also end up with many zero byte PCAP files if the disk is full, see PCAP Deletion.

Can I virtualize Arkime with KVM using OpenVswitch?

In small environments with low amounts of traffic this is possible. With Openvswitch you can create mirror port from a physical or virtual adapter and send the data to another virtual NIC as the listening interface. In KVM, one issue is that it isn’t possible to increase the buffer size past 256 on the adapter using the Virtio network adapter (mentioned in another part of the FAQ). Without Arkime capture will continuously crash. To solve this in KVM, use the E1000 adapter, and configure the buffer size accordingly. Set up the SPAN port on Openvswitch to send traffic to it: https://www.rivy.org/2013/03/configure-a-mirror-port-on-open-vswitch/.

Installing MaxMind Geo free database files

MaxMind recently changed how you download their free database files. You now need to signup for an account and setup the geoipupdate program. If using a version of Moloch before 2.2, you will need to edit your config.ini file and update the geolite paths.

Instructions:

  1. Sign up for a MaxMind account (no purchase required)
  2. Wait for MaxMind email and set your password
  3. Install the geoipupdate tool, pay attention to version installed, for many distributions you can just do a yum install geoipupdate or apt-get install geoipupdate
  4. Create a license key
  5. Select Yes when asked "Will this key be used for GeoIP Update?" and select the version you have
  6. Use the MaxMind feature to generate a config file for you, usually you will replace /etc/GeoIP.conf with this file
  7. Run geoipupdate as root and see if it works
  8. If you are using Moloch before 2.2, update your /data/moloch/etc/config.ini file so that geoLite2Country is now /usr/share/GeoIP/GeoLite2-Country.mmdb and geoLite2ASN is now /usr/share/GeoIP/GeoLite2-ASN.mmdb
  9. Restart capture

What do these log lines mean?

Arkime logs a lot of information for debugging purposes. Much of this information is for bug reports, but can also be used to figure out what is going on. You may need to use --debug to enable these msgs.

HTTP Responses

Jan 01 01:01:01 http.c:369 moloch_http_curlm_check_multi_info(): 8000/30 ASYNC 200 http://eshost:9200/_bulk 250342/5439 14ms 12345ms
Jan 01 01:01:01Date
http.c:369File Name:Line Number
moloch_http_curlm_check_multi_infoFunction Name
8000/308000 queued requests to server
30 connections to server
ASYNCAsynchronous request, SYNC for Synchronous request
200HTTP status code
http://eshost:9200/_bulkRequested URL
250032/5439250342 bytes uploaded (CURLINFO_SIZE_UPLOAD)
5439 bytes downloaded (CURLINFO_SIZE_DOWNLOAD)
14ms14ms to connect to server (CURLINFO_CONNECT_TIME)
12345ms12345ms total request time (CURLINFO_TOTAL_TIME)

Periodic Packet Progress

Jan 01 01:01:01 packet.c:1185 moloch_packet_log(): packets: 3911000000 current sessions: 41771/45251 oldest: 0 - recv: 4028852297 drop: 123 (0.00) queue: 1 disk: 2 packet: 3 close: 4 ns: 5 frags: 0/1988 pstats: 4132185901/1/2/3/4/5/6
Jan 01 01:01:01Date
packet.c:1185File Name:Line Number
moloch_packet_logFunction Name
packets: 39110000003911000000 packets are going to be processed by the packet queues. These packets have made it past corrupt checks, packet-drop-ips checks, and are ones we most likely understand.
current session: 41771/4525141771 monitored sessions of the current session type (usually tcp)
45251 monitored sessions total
oldest: 0In the current session type queue, the oldest session should be idled out in 0 seconds
recv: 40288522974028852297 packets have been received by the interface since process start, as reported by the reader's stats api
drop: 123123 packets have been dropped by the interface, as reported by the reader's stats api
(0.00)0.00% packets have been dropped by the interface, as reported by the reader's stats api
queue: 11 bulk request is waiting to be sent to the OpenSearch/Elasticsearch servers, each bulk request may hold multiple sessions
disk: 22 disk buffers writes are outstanding, each buffer will hold multiple packets
packet: 33 packets are waiting to be processed in all the packet queues
close: 44 tcp sessions have been marked for closing (RST/FIN), waiting on last few packets
ns: 55 sessions are ready to be saved but there is a plugin that is doing async work, such as WISE
frags: 0/1988always 0
1988 current ip frags waiting to be matched
pstats: 4132185901/1/2/3/4/5/6 4132185901 packets successfully sent to a packet queue
1 packet dropped because of packet-drop-ips config
2 packets dropped because the packet queues were overloaded
3 packets dropped because they were corrupt
4 packets dropped because how to process was unknown to us
5 packets dropped because of ipport rules
6 packets dropped because of packet deduping (2.7.1 enablePacketDedup)


Viewer

Where do I learn more about the expressions available

Click on the owl and read the Search Bar section. The Fields section is also useful for discovering fields you can use in a search expression.

Exported PCAP files are corrupt, sometimes session detail fails

The most common cause of this problem is that the timestamps between the Arkime machines are different. Make sure ntp is running everywhere, or that the time stamps are in sync.

Map counts are wrong

  • The source and destination IPs are each counted, so the map should total twice the number of sessions.
  • Currently OpenSearch/Elasticsearch only has accurate counts up to 2 billion uniques.
  • Some countries aren’t shown, but can still be searched using their ISO-3 (< 1.0) or ISO-2 (>= 1.0).

What browsers are supported?

Recent versions of Chrome, Firefox, and Safari should all work fairly equally. Below are the minimum versions required. We aren’t kidding.

Arkime Version Chrome Firefox Opera Safari Edge IE
Prior to 3.0 53 54 40 10 14 Not Supported
3.0 and beyond 80 74 67 13.1 80 Not Supported

Development and testing is done mostly with Chrome on a Mac, so it gets the most attention.

Error: getaddrinfo EADDRINFO

This seems to be caused when proxying requests from one viewer node to another and the machines don’t use FQDNs for their hostnames and the short hostnames are not resolvable by DNS. You can check if your machine uses FQDNs by running the hostname command. There are several options to resolve the error:

  1. Use the --host option on capture
  2. Configure the OS to use FQDNs.
  3. Make it so DNS can resolve the shortnames or add the shortnames to the hosts file.
  4. Edit config.ini and add a viewUrl for each node. This part of the config file must be the same on all machines (we recommend you just use the same config file everywhere). Example:
    [node1_eth0]
    interface=eth0
    viewUrl=http://node1.fqdn
    [node1_eth1]
    interface=eth1
    viewUrl=http://node1.fqdn
    [node2]
    interface=eth1
    viewUrl=http://node2.fqdn

How do I proxy Arkime using Apache

Apache, and other web servers, can be used to provide authentication or other services for Arkime when setup as a reverse proxy. When a reverse proxy is used for authentication it must be inline, and authentication in Arkime will not be used, however Arkime will still do the authorization. Arkime will use a username that the reverse proxy passes to Arkime as a HTTP header for settings and authorization. See the architecture page for diagrams. While operators will use the proxy to reach the Arkime viewer, the viewer processes still need direct access to each other.

  • If you are using SElinux in enforcing mode you may need to make changes for things to work, or disable SElinux. It has been reported that
    setsebool -P httpd_can_network_connect 1
    is required.
  • Install Apache, turn on the auth method of your choice. This example also uses HTTPS from Apache to Arkime, but if on localhost that isn’t required. Configure it to set a special header for Arkime to check. In this example ARKIME_USER is the header that is being set from a variable, if your auth method already sets a header use that.
    AuthType your_auth_method
    Require valid-user
    RequestHeader set ARKIME_USER %{your_auth_method_concept_of_username_variable_probably_REMOTE_USER}e
  • Make sure mod_ssl is loaded, and set up a SSL proxy:
    SSLProxyEngine On
    #ProxyRequests On # You probably don't want this line
    ProxyPass        /arkime/ https://localhost:8005/ retry=0
    ProxyPassReverse /arkime/ https://localhost:8005/
  • Restart Apache.
  • Using the Arkime UI (by going directly to a non proxy Arkime) make sure the "Web Auth Header" is checked for the users.
  • Edit Arkime’s config.ini
    • Create a new arkime-proxy section (you can use any name) for the Arkime proxy.
    • Set userNameHeader to the lower case version of the header Apache is setting. NOTE - the userNameHeader setting is only needed on viewers that apache talks to, don't set on all of them.
    • Set the webBasePath to the ProxyPath location used above. All other sections should NOT have a webBasePath.
    • Add a viewHost=localhost, so externals can’t just set the userNameHeader and access Arkime with no auth:
      [arkime-proxy]
      userNameHeader=arkime_user
      webBasePath = /arkime/
      viewPort = 8005
      viewHost = localhost
  • Start the arkime-proxy viewer, so for this example you would need to add -n arkime-proxy to your systemd file (/etc/systemd/system/molochviewer.service by default) on the ExecStart line after viewer.js so viewer uses that section
  • To prevent the users from going directly to Arkime in the future, scramble their passwords. You might want to leave an admin user that doesn’t use the Apache auth. Or you can temporarily add one with the addUser.js script.
  • If experiencing issues, try running viewer with --debug, add by editing the systemd file and restarting viewer

I still get prompted for password after setting up Apache auth

  1. Make sure the user has the "Web Auth Header" checked
  2. Make sure in the viewer config userNameHeader is the lower case version of the header Apache is using.
  3. Run viewer.js with a --debug and see if the header is being sent.

How do I search multiple Arkime clusters

It is possible to search multiple Arkime clusters by setting up a special Arkime MultiViewer and a special MultiES process. The MultiES process is similar to Elasticsearch tribe nodes, except it was created before tribe nodes and can deal with multiple indices having the same name. The MultiViewer talks to MultiES instead of a real OpenSearch/Elasticsearch instance. Currently one big limitation is that all Arkime clusters must use the same serverSecret.

To use MultiES, create another config.ini file or section in a shared config file. Both multies.js and the special "all" viewer can use the same node name. See Multi Viewer Settings for more information.

# viewer/multies node name (-n allnode)
[allnode]
# The host and port multies is running on, set with multiESHost:multiESPort usually just run on the same host
elasticsearch=127.0.0.1:8200
# This is a special multiple arkime cluster viewer
multiES=true
# Port the multies.js program is listening on, elasticsearch= must match
multiESPort = 8200
# Host the multies.js program is listening on, elasticsearch= must match
multiESHost = localhost
# Semicolon list of OpenSearch/Elasticsearch instances, one per arkime cluster.  The first one listed will be used for settings
# You MUST have a name set
multiESNodes = http://escluster1.example.com:9200,name:escluster1,prefix:PREFIX;http://escluster2.example.com:9200,name:escluster2
# Uncomment if not using different rotateIndex settings
#queryAllIndices=false

Now you need to start up both the multies.js program and viewer.js with the same config file AND -n allnode. All other viewer settings, including webBasePath can still be used.

By default, the users table comes from the first cluster listed in multiESNodes. This can be overridden by setting usersElasticsearch and optionally usersPrefix in the multi viewer config file.

How do I use self-signed SSL/TLS Certificates with MultiES?

Since 4.2.0 MultiES supports the caTrustFile setting.

Priority to 4.2.0 you will need to create a file, for example CAcerts.pem, containing one or more trusted certificates in PEM format.

Then, you need start MutilES adding NODE_EXTRA_CA_CERTS environment variable specifying the path to file you just created, for example:

NODE_EXTRA_CA_CERTS=./CAcerts.pem /opt/arkime/bin/node multies.js -c /opt/arkime/etc/config.ini -n allnode

How do I reset my password?

An admin can change anyone’s password on the Users tab by clicking the Settings link in the Actions column next to the user.

A password can also be changed by using the addUser script, which will replace the entire account if the same userid is used. All preferences and views will be cleared, so creating a secondary admin account may be a better option if you need to change an admin users password. After creating a secondary admin account, change the users password and then delete the secondary admin account.

node addUser -c <configfilepath> <user id> <user friendly name> <password> [--admin]

Error: Couldn’t connect to remote viewer, only displaying SPI data

Viewers have the ability to proxy traffic for each other. The ability relies on Arkime node names that are mapped to hostnames. Common problems are when systems don’t use FQDNs or certs don’t match.

How do viewers find each other

First the SPI records are created on the capture side.

  1. Each capture gets a nodename, either by the -n command line option or everything in front of the first period of the hostname.
  2. Each capture writes a stats record every few seconds that has the mapping from the nodename to the FDQN. It is possible to override the FDQN with the --host option to capture.
  3. Each SPI record has a nodename in it.

When PCAP is retrieved from a viewer it uses the nodename associated with the SPI record to find which capture host to connect to.

  1. Each arkime-viewer process gets a nodename, either by the -n command line option or everything in front of the first period of the hostname.
  2. If the SPI record nodename is the same as the arkime-viewer nodename it can be processed locally, STOP HERE. This is the common case with one arkime node.
  3. If the stats[nodename].hostname is the same as the arkime-viewer’s hostname (exact match) then it can be processed locally, STOP HERE. Remember this is written by capture above, either the FQDN or --host. This is the common case with multiple capture processes per capture node.
  4. If we make it here, the PCAP data isn’t local and it must be proxied.
  5. If there is a viewUrl set in the [nodename] section, use that.
  6. If there is a viewUrl set in the [default] section, use that.
  7. Use stats[nodename].hostname:[nodename section - viewPort setting]
  8. Use stats[nodename].hostname:[default section - viewPort setting]
  9. Use stats[nodename].hostname:8005

Possible fixes

First, look at viewer.log on both the viewer machine and the remote machine and see if there are any obvious errors. The most common problems are:

  1. Not using the same config.ini on all nodes can make things a pain to debug and sometimes not even work. It is best to use the same config with different sections for each node name [nodename]
  2. The remote machine doesn’t return a FQDN from the hostname command AND the viewer machine can’t resolve just the hostname. To fix this, do ONE of the following:
    1. Use the --host option to capture and restart capture
    2. Make it so the remote machines returns a FQDN (hostname "fullname" as root and edit /etc/sysconfig/network)
    3. Set a viewUrl in each node section of the config.ini. If you don’t have a node section for each host, you’ll need to create one.
    4. Edit /etc/resolv.conf and add search foo.example.com, where foo.example.com is the subdomain of the hosts. Basically, you want it so "telnet shortname 8005" works on the viewer machine to the remote machine.
  3. The remote machine’s FQDN doesn’t match the CN or SANs in the cert it is presenting. The fixes are the same as #2 above.
  4. The remote machine is using a self signed cert. To fix this, either turn off HTTPS or see the certificate answer above.
  5. The remote machine can’t open the PCAP. Make sure the dropUser user or dropGroup group can read the PCAP files. Check the directories in the path too.
  6. Make sure all viewers are either using HTTPS or not using HTTPS, if only some are using HTTPS then you need to set viewUrl for each node.
    1. When troubleshooting this issue, it is sometimes easier to disable HTTPS everywhere
  7. If you want to change the hostname of a capture node:
    1. Change your mind :)
    2. Reuse the same node name as previously with a -n option
    3. Use the viewUrl for that old node name that points to the new host.

Compiled against a different Node.js version error

Arkime uses Node.js for the viewer component, and requires many packages to work fully. These packages must be compiled with and run using the same version of Node.js. An error like …​ was compiled against a different Node.js version using NODE_MODULE_VERSION 48. This version of Node.js requires NODE_MODULE_VERSION 57. means that the version of Node.js used to install the packages and run the packages are different.

This shouldn’t happen when using the prebuilt Arkime releases. If it does, then double check that /opt/arkime/bin/node is being used to run viewer.

If you built Arkime yourself, this usually happens if you have a different version of node in your path. You will need to rebuild Arkime and either:

  • Remove the OS version of node
  • Make sure /opt/arkime/bin is in your path before the OS version of node
  • Use the --install option to easybutton which will add to the path for you

How do I change the port viewer listens on?

By default viewer listens on port 8005. Changing this can be tricky, especially for a port less than 1024, like 443. You should definitely read the How do viewers find each other section.

Scenario Solutions
Change all nodes to port > 1024 Set viewPort in [default] section on ALL nodes
Change single node port < 1024, remaining nodes (if any) unchanged Usually unless a program runs as root it can NOT listen to ports less than 1024. Since viewer by default drops privileges before listening, even if you start as root, it isn't root anymore when trying to listen on the port. Possible solutions are:
  • Use a reverse proxy like Apache/Nginx. This is a great option for a central node that needs to be behind SSO, and all other nodes are blocked from users directly using
  • Use iptables to forward from new port to 8005. Something like
    iptables -t nat -I PREROUTING -p tcp --dport 443 -j REDIRECT --to-ports 8005
  • Fool around with the systemd CAP_NET_BIND_SERVICE setting
  • Comment out the dropUser setting and change the viewPort setting in a [$nodename] section.
All nodes, port < 1024 Just don't. :) If you must, most of the solutions above will work, but don't do the reverse proxy solution since viewer nodes need to talk to each other WITHOUT external authentication.

Hunts not working

For Hunts to work properly you must set cronQueries=true on one and only one node.

If cronQueries is properly set up on a single node, and hunts still aren't working, make sure the cronQueries node is running and checking in. You can check this on the Stats -> ES Nodes tab and/or check the viewer logs.

See more information about the cronQueries setting here.


Parliament

Sample Apache Config

Parliament is designed to run behind a reverse proxy such as Apache. Basically, you just need to tell Apache to send all root requests and any /parliament requests to the Parliament server.

ProxyPassMatch   ^/$ http://localhost:8008/parliament retry=0
ProxyPass        /parliament/ http://localhost:8008/parliament/ retry=0


WISE

WISE is not working

Here is the common check list:

  1. Check that WISE is running
    curl http://localhost:8081/fields
    You should see a list of fields that WISE knows about.
  2. Check in your config.ini file you've added
  3. Check that from the capture/viewer hosts you can reach the viewer hosts and there are no ACL issues.
    curl http://WISEHOST:8081/fields
  4. Restart capture after adding a --debug option may print out useful information what is wrong. Look to make sure that WISE is being called with the correct URL. Verify that the plugins, wiseHost and wiseURL setting is what you actually think it is.

arkime.com

How can I contribute?

Want to add or edit this FAQ? Found an issue on this site? This site's code is open source. Please contribute!

Arkime Logo