Desktop Search Hackfest/Day one notes

lightning talks

beagle • gmail • compatibility between extractors and similar components should be discussed about • avahi, web services provided by the daemon itself • online results shown in firefox • external devices/sources working • feature wise beagle feels they have reached a complete state • more interested in dashboard now • a masters student working on dashboard at the moment • million command line tools to proof the concept • how to tie documents and objects together (asociations) without consumin lots of processing power all the time for the creation of the associations

Strigi

• ported to mac and windows • Basic requirement of kde4 • modularized into 4 c++ libraries • libstreams ∘ efficient access to files as streams ∘ reading nested streams without overhead: no copying of intermediate data ∘ tar, bz2, zip, rpm, deb pdf, ole2, mail, .a ∘ jstream:// ‣ Efficient for actually getting inner contents of objects without the need to create intermediate copy of the objects • libstreamanalyzer ∘ api for accessing metadata from streams ∘ uses libstreams ∘ uses 5 plugins for analyzers, each optimized for speed of specific data type ∘ most analyzers can be used in parallel ∘ all analyzers always run, with little overhead ∘ index object is getting the data from each content item ∘ analyzers also run on the embedded filescontent ∘ plugins for the indexes, so in theory different indexes can be used ‣ most important clucene and soprano • clucene very fast • soprano used for nepomuk for semantic storage ‣ could use tracker as the index as well • libseachclient ∘ socket access to strigidaemon • libstrigihtmlgui • libstrigiqtdbusclient • strigidaemon ∘ c++ ∘ relativelhy little code in the daemon ∘ abstract interface for communications ∘ thread pool ∘ initial xesam support, partially shared with pinot • POWER OF STRIGI IN IN LIBRARIES THAT CAN BE SHARED

Xesam • sd • Using mirror • beginnint - dashbar used in beginning • seemed poor to make compatibility layer for all engines (strigi, beagle, ...) • unite existing apis • hide implementation details of backends • allow simpler indexes such as grepp • support for web services such as google • keeping it simple was the leading thought • does not support everything, but instead mostly desktop and mobile devices • keep low barrier of entry • preoactive openess ∘ community effort • history ∘ 2 apis ∘ simple query and 'live' ∘ simple api was too simple - so, live was the way to go ∘ xesam ‣ dbus search api ‣ ontology ‣ xml query language ‣ user search language ∘ full draft online - devil in the details ∘ summarizing critic: ‣ too complex onto ‣ feature creeping (it's etting too complex) ‣ not extensible enough ‣ vulnerability to extend and embrace tactics

tracker • last half year • work has revolved around splitting daemon and indexer into separate entities • extractor process spwns and pipes data to the indexer • filters handle the files • weighting data items as 'hot' based on the headingness or boldness • daemon has proper introspection • dbus api has had only a little changes • planning to change that quite a lot to support xesam • throttling - pausing indexing with monitoring events, nice, dynamic blacklisting of files, (blacklisting of dirs?) • q: wrapper of streamanalyzer? ∘ kde4 metadata extraction using streamanalyzers - all metadata provided with those ∘ suggestion for traker to consider streamanalyzers • Suggestion from kubasic, jos, jamie: use a common extraction pipeing model • jos: suggestion to not to use dbus between the extractor and daemon process • daemon only there to service the data • indexer there to handle the writing of the actual data • suggestion to keep this in tips and tricks session • last issue before merge - fix jamies tst on his laptop •

ontologies: • evgeny giving introduction • basic design decisions • why rdf triples • inheritance needed • gives example of hierarchy (xesam:contributor in core, client extension this as xesam:majorContributor and xesam:minorContributor) • this makes applications that do not know of the cleint extensions to be able to interoperate with the data • How many features we want to implement. 95% (but not more?) • idea to drop corner cases • not only what features we implement but also to implement those well, this will lead to change in user habits (example file navigation used to cover 95% of file location needs . after indexers, this can be dimished to 60-70% - but only if indexers work well, are reliable and easy to use) • thus, we need simple enough fw that is powerful enough, but easy enough to implement and start to use • content and source split ∘ source is that the file is and how is it stored ∘ content deals with documents, messages and such ∘ for the content tree, it does not matter how it is stored ∘ source is the place for acl ∘ document has the created date - it is the actual created date of the document ∘ source created date is the filesystem level creation date ∘ also modified on source could be when e.g. acl was changed, on document tree, changed means when someone actually edited the document ∘ evgeny gives example of a scientific paper ∘ xesam:author links the item to xesam:contact ∘ the xesam:contact - country could be e.g. UK ∘ now, if we want to find list documents of people from uk ∘ the multiple query solution for this use case can be very slow to implement if the contects contain tens of thousands of people ∘ so, the dbus traffic is unnecessarily large ∘ and there are many other use cases of this type of relationships that will be too difficult to implement with the current xesam model efficiently ∘ photo annotations ‣ defining region of a photo and who is there ‣ thus, the annotation is complex ‣ example - photos of old people ‣ move toward semantic representation of data. FOF example of the photo annotations ‣ semantic web activities going on in this area at the moment ‣ another example: papers referring it to other papers via identifiers without directly pointing to a file ‣ point being that the client will have to be doing the work that really makes sense to be done on the backend side ‣ suggestion: use xesam (tm) if it correctly implements the core set of the functionality ‣ we will need to be future proof and not stagnate and have people making hacks and workarounds ‣ philip van hoof: standard will be successful only if the client developers take it to use ‣ martyn: it's an iterative process ‣ client developers are our users ‣ using xesam needs to be easy and it needs to be able to provide easily the needed information ‣ evgeny: how do we solve this conflict of interest: we need to define xesam in flexible enough way ‣ evgeny: it is not simple enough for the users as it is ‣ mikkel: two camps: semantic camp and field based camp, my opinion is more on the field based camp - thus, I don't see the needs of this type very relevant for our needs ‣ kubasic: I agree that mutidimensional graphs not so interesting - maybe we could provide a convenience library to do this kind of depth relationships. So, perhaps xesam could just provide the field based stuff and let the wrappers worry about the more extended multidimensional use cases ‣ evgeny: my opinion would be to make it possible for the backend to be able to do this and provide the wrapper for the simple backends to do the field too graph use cases. ‣ maybe we could roadmap that only some parts would be available of the query language and ontology so that ‣ kubasic: if we do paged browsing the deeper relationships will be completely awful to be implement efficiently - so most often the deeper levels should not be so needed to be done ‣ evgeny: this need will either be implemented by the cleint or the server. use cases are there ‣ jos: can you provide a list of items to change ‣ evgeny: we need to decide whether we want to support these structures or not ‣ evgeny: can provide list of drawbacks ‣ current situation undecided what is the type of the e.g. author (reference as url or string of the author) - needs to be decided ‣ nepomuk guy: ‣ jos: my suggestion - if I have author email address - then we go to the next version, where we have contacts - I will store the contact url t database ‣ urho: suggesting inner queries to the query api ‣ evgeny: let me give more examples • sparql - and current query language • • we can replace the current query language later on ‣ jamie: the mapping really doesn't exist on the actual contents, so the documents usually don't map the actual authors, but instead they just contain the freetext of the authors ‣ so, as long as the applications do not contain the possibilities, the graphs won't happen ‣ evgeny: we need to make it possible for the applications to do this, otherwise the applications wil lnever support it ‣ nepomuk guy: shouldn't we make the inner queries optional as well as the graph queries • intermediate solution - flat view would map queries to text and on the graph capable implementations to the object url instead • possiblity to create dummy items for each url and containing the string in that object ‣ qubasic: this will be needing lot of more support from the backend will be needed and it will be much more complicated to do - would be more future compatible to be on the xesam ‣ nepomuk guy: internal stuff could create the internal representations in hackish way and just expose the data in graph way ‣ evgeny: proposing a far reaching goal to make it easy for users in mid term future to transition to this - maybe roadmat for applications to be able to produce more complex queries ‣ Urho: so, do you propose SPARQL? ‣ evgeny: nope - we should be able to extend current XML ‣ nepomuk guy: we could create related to inner queries ‣ mikkel: why not jus tallow the results of one query to be source for the next one ‣ urho: so inner queries - yeah ‣ evgeny: makes an example of the paper question in sparql ‣ evgeny: we are not looking for using sparql, just similar functionality ‣ evgeny: mapping sparql to sql can be done ‣ jos: I propose that we won't implement this to 1.0 • just make the ontology as future proof • make a temporary solution and future path ‣ evgeny: list of use cases • we have no way to query - semantic desktop guys will not be able to use xesam in the way they can use nepomuk • maybe we could at least make a wrapper library? ‣ JOS: making the proposal • usually we store the name of author in a field on document • in graph we would store it on separate object • so, propose that we do it like this now and if engine supports the denormalized model, it will do the conversion from the object to the name • and we extend the ontology the denormalized way • nepomuk guy: we have the proper ontology in nepomuk, we could just translate that to xesam • qubasik: suggest that we make a small team to discuss this in real detail • agreed: 1st item: FAIL -

Search API (d-bus api) • Notes taking by urho • latest changes for sorting and some session properties • we simply list all the problems that people have with it • inner queries agreed to be handled separately • jamie raising issue of ranged search • philip: why don't we remove the session completely - it's completely unused • jos: i'll explain the session - it's for live queries ∘ you can have multiple searches per session ∘ search thus should have a separate id so that that can be closed, there is no need for the session at all • kubasic: • rob: you need to know what searches are running • mikkel: if we create the session at the time of the start • martyn: we need to have a separate id for searches • rob: we have a separate id for searches ∘ can't we add the properties to the query itself? ∘ go for large and infrequent messages • kubasik: we should decide whether to use the properties on the xml or on the dbus • mikkel: properties can go on the search object instead (written) • philip: I agree with sebastian (nepomuk guy) hit fields should be in the query • philip: we can go instead for the session changes to be for the xesam 2.0 and not yet to the 1.0 • philip: if we change the query language, that'll break everything anyway • rob: seems like you are using ids for the objects on the server side - perhaps they actually should be d-bus objects • jos: so, current model is imperfect, but enough for 1.0 • jos: let's keep it like it is for now as it's works • versioning discussion - should dbus have the version number in the dbus namespace? • rob checks- yes you can • kubasik: we need to have a bit of better way to decide which engine to use ∘ but we don't need to specify the way to do it • philip: we should have announcement of capabilities to allow applications to choose the right engine with the right capabilities • kubasik: distributions need to be taken into account here • mikkel: we need to have the documentation about the capabilities and make the apis to discover • mikkel: can be considered for 1.0 (at least guidelines) ∘ let's make a group to specify this • jos: gethitdata will be capable of handling the ranges quite easily as it's capable • urho: we have the case where hitsadded hit is on a sorted live query and on place on the visible resultset. without the id and uri of the hitadded, you will have to re-query your entire set • kubasik: most of the time the issue is really that single item has been modified • mikkel: so, we would emit: livehitsadded • jos: live search never sends researchdone • mikkel: yes it sends once the indexing is done • mikkel: let's add livesearch added that would contain also the hit metadata • jos: we need to minimize the changes needed • everyone can live with even the paging • jamie: so, no changes for 1.0 for hitsadded, but for 2.0 yes • jos: idea for ranged search will mean that we supply the ids of the hits • michael: d-bus api ∘ sometimes several search engines in parallel is an issue ∘ one engine refuses to start if d-bus api not available ∘ we should make search engines to que on d-bus ∘ no way to decide which engine is started ∘ mikkel: we have also xesam interface on tracker, so that can be used org.freedesktop.tracker.xesam? ∘ philip: we don't ∘ mikkel: perhpas that would be a recommendation ∘ urho: how about org.freedesktop.xesam.tracker and org.freedesktop.beagle ∘ mikkel: good idea ∘ michael raising issues about the specialist engines existing on the same system • philip: wants to raise something ∘ this suggestion is unimplementable on sqlite as it is ∘ ottela: we get hits 1 - 1000 - we change the value of item 1 to be last on the list ∘ ottela: if one file changes - hitsmodified tells that one changed to 1000 and 2 changed to 1, 3 to 2,... ∘ jos: why don't we say the gethitsdata needs to be sequential ∘ philip: the issues with 1, 10000, 500 will mean the complexity of the corner cases will become very difficulta ∘ and this will make it quite complex to do properly on the server side ∘  ∘ urho: how about adding getrangedhits ∘ jos: if we don't remove the gethisdata that already exists ∘ agreed by everyone • martyn: I'd rather have a new hitsmoved rather than having the hitsremoved and hitsadded • jos: currently the positionshould not be changing? • kubasik explains beagle issues • martyn: sounds nice, but the issue is that the new api would make it more clear for the client applications to use • kubasik: it's going to be difficult to implement the live queries • martyn: surely we should add the possibility to handle the things without the need to re-create the live query • sebastian: how about another working group? • mikkel agreed: • so, group to discuss the live sorting and hits added / moved issue • kubasik: we'll have the group and I can tell how beagle is doing this • mikkel: one for this and another group for the accessibility • sebastian proposes to use floating points for the ordering • ben raises the issue that how hard will it be possible to sort by the strings, the text • Mikkel: live query sorting who is the group? • Jos: I don't think live queries will be efficiently possible on the server side • group: half and hour - jos, martyn, kubasik, mikael, mikkel (?) • LUNCH

Ontology discussion: Link by id - mikkel presenting on postits • evgenys idea originally as a middle way between semantic graph and the current xesam • problem is mainly linking relate objects (like linkin document and author) or document contained in an email • referring e.g. two documents together with isbn • there are multiple ids (not just uris) • they will need to be 'sufficiently unique' • Solving the problem ∘ xesam id field (any child of it must have a value that is unique over all xesam ids) ∘ all of the links must always be to ids ∘ we make a new special kind of field that is child of xesam:related or could be a special data type ∘ should always hold xesam id ∘ with this we can find all objects that point to X (or any that X points to) ∘ so, e.g. pdf file reference isbn - query xesam id field with that isbn ∘ ∘ sebastian: subqueries should be enough to allow good usage of this on the search api ∘ urho: agreed ∘ Everybody agrees to investigate the idea ∘ mikkel: suggest using a new field in xesam api id-field ∘ kubasik: we would need to do nested queries most probably and not probably an optimized index for this ∘ sebastian: normally you would say give me all emails htat have sender of contact object that has name like this ‣ now you could do it sender as a range, of person object, it would check all xesam id values in the relation ∘ mikkel: it shoud be very easy to do these queries ‣ and it should be very easy to show the values of these fields ∘ jos: we should do so that we do two fields: radable valua and pointer/relation property. Also, the more strong engines could say that the value is just an alias of the actual objects relations property ∘ sebastian: why not use the uri for data representation plugins that would then be able to create the visible widgets of the data items ∘ sebastian: i see that there are some issues with people wanting to not to do two queries ∘ evgeny: you would need inference for it and it's easier so that we make it so that xesam:strings: would contain the string representation of the url / link values. ∘ so, this way, we only use the string representation for only full text search. ∘ evgeny: I feel that we have a bit of problem on the specification of what should be in the FTI ∘ Anders: we need to also be able to show how/why the relation was created and to be able to do that for all the related content ∘ evgeny: we cannot have the stuff linking to everything on the database ∘ sebastian: we don't need to present on how a relation was found and it's a GUI thing ∘ jamie: two fields would be more clean ∘ jamie draws example below evgenys example ∘ jos agrees on the two field model ∘ evgeny isn't completely convinced ∘ the point is that the UI will likely match the actual contact ∘ jamie: this way the application would have the choice ∘ jamie: if we have just a name on the author text, then we don't want to try to create the dummy object ∘ evgeny: compromise solution in my opinion is that if we have to use have use the subqueries ∘ discussion on that we want a value of properties on a single query on xesam ∘ jamie: we could always deprecate the fields ∘ jos: i liked the suggestion of author text and author link and group the fields on a string text ∘ sebastian: why not make all dummy objects to have the same data layout ∘ at least the data is syntatically correct if we use dummy objects ∘ jamie: it's not worth it if we don't have the semantic info ∘ sebastian: this way we handle the data in same way for the 'proper' objects as well as the dummy ones ∘ ∘ Conclusion: ‣ We make some older strings as URIs instead of the previous string values ‣ All new URI properties also have a .label that allows to get the textual representation of the property, which in essence is the old xesam ontology value of the property. The ideal is that this is a pointer to a specific property of the linked object ‣ We will draft nested queries and they will most probably go to 1.1 ∘