处理大型文档时提升相关性

    Meilisearch 针对处理段落大小的文本块进行了优化。包含大量文本的大量文档的数据集可能会导致搜索结果相关性降低。

    在本指南中,您将了解如何使用 JavaScript 与 Node.js 拆分单个大型文档,并使用不同的属性配置 Meilisearch 以防止重复结果。

    要求

    数据集

    stories.json 包含两个文档,每个文档在其 text 字段中存储一个短篇故事的完整文本

    [
      {
        "id": 0,
        "title": "A Haunted House",
        "author": "Virginia Woolf",
        "text": "Whatever hour you woke there was a door shutting. From room to room they went, hand in hand, lifting here, opening there, making sure—a ghostly couple.\n\n \"Here we left it,\" she said. And he added, \"Oh, but here too!\" \"It's upstairs,\" she murmured. \"And in the garden,\" he whispered. \"Quietly,\" they said, \"or we shall wake them.\"\n\nBut it wasn't that you woke us. Oh, no. \"They're looking for it; they're drawing the curtain,\" one might say, and so read on a page or two. \"Now they've found it,\" one would be certain, stopping the pencil on the margin. And then, tired of reading, one might rise and see for oneself, the house all empty, the doors standing open, only the wood pigeons bubbling with content and the hum of the threshing machine sounding from the farm. \"What did I come in here for? What did I want to find?\" My hands were empty. \"Perhaps it's upstairs then?\" The apples were in the loft. And so down again, the garden still as ever, only the book had slipped into the grass.\n\nBut they had found it in the drawing room. Not that one could ever see them. The window panes reflected apples, reflected roses; all the leaves were green in the glass. If they moved in the drawing room, the apple only turned its yellow side. Yet, the moment after, if the door was opened, spread about the floor, hung upon the walls, pendant from the ceiling—what? My hands were empty. The shadow of a thrush crossed the carpet; from the deepest wells of silence the wood pigeon drew its bubble of sound. \"Safe, safe, safe,\" the pulse of the house beat softly. \"The treasure buried; the room ...\" the pulse stopped short. Oh, was that the buried treasure?\n\nA moment later the light had faded. Out in the garden then? But the trees spun darkness for a wandering beam of sun. So fine, so rare, coolly sunk beneath the surface the beam I sought always burnt behind the glass. Death was the glass; death was between us; coming to the woman first, hundreds of years ago, leaving the house, sealing all the windows; the rooms were darkened. He left it, left her, went North, went East, saw the stars turned in the Southern sky; sought the house, found it dropped beneath the Downs. \"Safe, safe, safe,\" the pulse of the house beat gladly. \"The Treasure yours.\"\n\nThe wind roars up the avenue. Trees stoop and bend this way and that. Moonbeams splash and spill wildly in the rain. But the beam of the lamp falls straight from the window. The candle burns stiff and still. Wandering through the house, opening the windows, whispering not to wake us, the ghostly couple seek their joy.\n\n\"Here we slept,\" she says. And he adds, \"Kisses without number.\" \"Waking in the morning—\" \"Silver between the trees—\" \"Upstairs—\" \"In the garden—\" \"When summer came—\" \"In winter snowtime—\" The doors go shutting far in the distance, gently knocking like the pulse of a heart.\n\nNearer they come; cease at the doorway. The wind falls, the rain slides silver down the glass. Our eyes darken; we hear no steps beside us; we see no lady spread her ghostly cloak. His hands shield the lantern. \"Look,\" he breathes. \"Sound asleep. Love upon their lips.\"\n\nStooping, holding their silver lamp above us, long they look and deeply. Long they pause. The wind drives straightly; the flame stoops slightly. Wild beams of moonlight cross both floor and wall, and, meeting, stain the faces bent; the faces pondering; the faces that search the sleepers and seek their hidden joy.\n\n\"Safe, safe, safe,\" the heart of the house beats proudly. \"Long years—\" he sighs. \"Again you found me.\" \"Here,\" she murmurs, \"sleeping; in the garden reading; laughing, rolling apples in the loft. Here we left our treasure—\" Stooping, their light lifts the lids upon my eyes. \"Safe! safe! safe!\" the pulse of the house beats wildly. Waking, I cry \"Oh, is this _your_ buried treasure? The light in the heart."
      },
      {
        "id": 1,
        "title": "Monday or Tuesday",
        "author": "Virginia Woolf",
        "text": "Lazy and indifferent, shaking space easily from his wings, knowing his way, the heron passes over the church beneath the sky. White and distant, absorbed in itself, endlessly the sky covers and uncovers, moves and remains. A lake? Blot the shores of it out! A mountain? Oh, perfect—the sun gold on its slopes. Down that falls. Ferns then, or white feathers, for ever and ever——\n\nDesiring truth, awaiting it, laboriously distilling a few words, for ever desiring—(a cry starts to the left, another to the right. Wheels strike divergently. Omnibuses conglomerate in conflict)—for ever desiring—(the clock asseverates with twelve distinct strokes that it is midday; light sheds gold scales; children swarm)—for ever desiring truth. Red is the dome; coins hang on the trees; smoke trails from the chimneys; bark, shout, cry \"Iron for sale\"—and truth?\n\nRadiating to a point men's feet and women's feet, black or gold-encrusted—(This foggy weather—Sugar? No, thank you—The commonwealth of the future)—the firelight darting and making the room red, save for the black figures and their bright eyes, while outside a van discharges, Miss Thingummy drinks tea at her desk, and plate-glass preserves fur coats——\n\nFlaunted, leaf-light, drifting at corners, blown across the wheels, silver-splashed, home or not home, gathered, scattered, squandered in separate scales, swept up, down, torn, sunk, assembled—and truth?\n\nNow to recollect by the fireside on the white square of marble. From ivory depths words rising shed their blackness, blossom and penetrate. Fallen the book; in the flame, in the smoke, in the momentary sparks—or now voyaging, the marble square pendant, minarets beneath and the Indian seas, while space rushes blue and stars glint—truth? or now, content with closeness?\n\nLazy and indifferent the heron returns; the sky veils her stars; then bares them."
      }
    ]
    
    什么是 Meilisearch 的大型文档?

    Meilisearch 最适合处理大小小于 1kb 的文档。这大约相当于最多两三段文字。

    拆分文档

    在工作目录中创建一个名为 split_documents.js 的文件

    #!/usr/bin/env node
    
    const datasetPath = process.argv[2];
    const datasetFile = fs.readFileSync(datasetPath);
    const documents = JSON.parse(datasetFile);
    
    const splitDocuments = [];
    
    for (let documentNumber = documents.length, i = 0; i < documentNumber; i += 1) {
      const document = documents[i];
      const story = document.text;
    
      const paragraphs = story.split("\n\n");
      
      for (let paragraphNumber = paragraphs.length, o = 0; o < paragraphNumber; o += 1) {
        splitDocuments.push({
          "id": document.id,
          "title": document.title,
          "author": document.author,
          "text": paragraphs[o]
        });
      }
    }
    
    fs.writeFileSync("stories-split.json", JSON.stringify(splitDocuments));
    

    接下来,在您的控制台上运行该脚本,指定 JSON 数据集的路径

    node ./split_documents.js ./stories.json
    

    此脚本接受一个参数:指向 JSON 数据集的路径。它读取文件并解析其中的每个文档。对于文档的 text 字段中的每个段落,它都会创建一个具有新 idtext 字段的新文档。最后,它将新文档写入 stories-split.json

    生成唯一 ID

    现在,Meilisearch 将无法接受新数据集,因为许多文档共享相同的 主键。

    更新上一步中的脚本以创建一个新字段 story_id

    #!/usr/bin/env node
    
    const datasetPath = process.argv[2];
    const datasetFile = fs.readFileSync(datasetPath);
    const documents = JSON.parse(datasetFile);
    
    const splitDocuments = [];
    
    for (let documentNumber = documents.length, i = 0; i < documentNumber; i += 1) {
      const document = documents[i];
      const story = document.text;
    
      const paragraphs = story.split("\n\n");
      
      for (let paragraphNumber = paragraphs.length, o = 0; o < paragraphNumber; o += 1) {
        splitDocuments.push({
          "story_id": document.id,
          "id": `${document.id}-${o}`,
          "title": document.title,
          "author": document.author,
          "text": paragraphs[o]
        });
      }
    }
    

    该脚本现在将原始文档的 id 存储在 story_id 中。然后,它为每个新文档创建一个新的唯一标识符,并将其存储在主键字段中。

    配置不同属性

    此数据集现在有效,但由于每个文档实际上都指向同一个故事,因此查询可能会导致重复的搜索结果。

    为了防止这种情况发生,将 story_id 配置为索引的不同属性

    curl \
      -X PUT 'http://localhost:7700/indexes/INDEX_NAME/settings/distinct-attribute' \
      -H 'Content-Type: application/json' \
      --data-binary '"story_id"'
    

    搜索此数据集的用户现在将能够在大型文本块中找到更相关的结果,而不会丢失性能或出现重复。

    结论

    您已经了解了如何拆分大型文档以提高搜索相关性。您还了解了如何配置不同属性以防止 Meilisearch 返回重复结果。

    虽然本指南使用了 JavaScript,但您可以使用任何您熟悉的编程语言来复制此过程。