:orphan: .. _developers-mapping: MBID Mapping ============ The MBID mapping scripts allow us to take metadata from the messybrainz database and look up recording MBIDs from the MusicBrainz database. .. note:: The MBID Mapping source code lives in ``listenbrainz/mbid_mapping`` but is run independently from the main listenbrainz web docker image. You can use your own virtual environment or use ``listenbrainz/mbid_mapping/build.sh`` to build a standalone docker image. Database tables ^^^^^^^^^^^^^^^ The MBID Mapping supplemental tables hold preprocessed data from the MusicBrainz database. * ``mapping.canonical_musicbrainz_data``: The MBID and Name of Recordings, Artists (and credits), and Releases for all recordings in MusicBrainz * ``mapping.canonical_recording_redirect``: A mapping to find the "canonical" recording given an artist credit + recording name * ``mapping.canonical_release_redirect``: A mapping to find the "canonical" release given an artist credit + release name These tables can be populated by running .. code:: bash python mapper/manage.py canonical-data The update process build the new data in a temporary table and then replaces them in a single transaction. This means that lookups can continue to run on the existing tables while the new ones are being built. Fuzzy lookups ^^^^^^^^^^^^^ We use typesense as a way of performing quick, fuzzy lookups based on artist name and recording name Build the typesese index with .. code:: bash python mapper/manage.py build-index As with the data tables, a new typesense collection is created and then swapped into place in a single operation. Build the mapping tables and then the typesense index directly afterwards with .. code:: bash python mapper/manage.py create-all MBID Mapper ^^^^^^^^^^^ The mapper looks for new MSIDs submitted to messybrainz and finds a matching MBID in MusicBrainz ``python3 -u -m listenbrainz.mbid_mapping_writer.mbid_mapping_writer`` A background thread pushes items to be processed onto a queue - recent submissions first, and then if nothing is to be done, old items. The processing thread pops items off the queue and then looks them up, adding them to the ``mbid_mapping`` table. There is also a background thread that fires off daily, which looks for listens that have been written to the listens table, but for some reason do not have a matching mapping entry. (This could happen due to restarts or problems with the mapper itself). These are called legacy listens. The background thread will walk the entire listens table once a day to find these legacy listens and attempt to map them. In the same thread we also look for mapping items with timestamp of the unix epoch (1970-01-01 00:00:00), which indicates that they ought to be re-checked. Currently we have no automated mechanism in place for setting any mapping entries to the epoch. TODO: Detuning algorithm TODO: match quality types